Merging and Joining DataFrames in Pandas

Introduction

Pandas provides powerful functions to combine multiple datasets efficiently. In real-world scenarios, data is often stored in separate tables, requiring merging or joining operations to analyze it effectively.

Pandas offers three primary functions for combining DataFrames:

merge(): Used for database-style joins (similar to SQL).
join(): Simplifies merging based on index values.
concat(): Stacks DataFrames vertically or horizontally.

Step 1: Understanding the Need for Merging & Joining

When working with multiple datasets, merging is required to:

Combine related data from different sources.
Integrate data based on common keys (e.g., Customer ID, Product ID).
Perform relational operations like inner joins, outer joins, left joins, and right joins.

Example:
We have two tables:
Orders Table

Order_ID	Customer_ID	Amount
101	C001	250
102	C002	400
103	C003	150

Customers Table

Customer_ID	Name	Country
C001	John	USA
C002	Alice	Canada
C004	Mark	UK

To analyze orders along with customer details, we must merge these tables based on Customer_ID.

Step 2: Using `merge()` for SQL-like Joins

The merge() function is used to join two DataFrames based on common columns or indices.

Types of Joins in Pandas

Join Type	Description
Inner Join	Returns only matching rows from both tables.
Left Join	Returns all rows from the left table and matching rows from the right.
Right Join	Returns all rows from the right table and matching rows from the left.
Outer Join	Returns all rows from both tables, filling missing values with NaN.

Example Dataset Creation

import pandas as pd

# Creating Orders DataFrame
orders = pd.DataFrame({
    'Order_ID': [101, 102, 103],
    'Customer_ID': ['C001', 'C002', 'C003'],
    'Amount': [250, 400, 150]
})

# Creating Customers DataFrame
customers = pd.DataFrame({
    'Customer_ID': ['C001', 'C002', 'C004'],
    'Name': ['John', 'Alice', 'Mark'],
    'Country': ['USA', 'Canada', 'UK']
})

1. Inner Join (`how='inner'`)

Returns only matching records based on the key column.

inner_join = pd.merge(orders, customers, on='Customer_ID', how='inner')
print(inner_join)

Output:

Order_ID	Customer_ID	Amount	Name	Country
101	C001	250	John	USA
102	C002	400	Alice	Canada

✅ Observations:

The row with Customer_ID = C003 (from orders) is removed because it has no match in customers.
The row with Customer_ID = C004 (from customers) is removed because it has no match in orders.

2. Left Join (`how='left'`)

Keeps all records from the left table (orders), and fills missing values from the right table with NaN.

left_join = pd.merge(orders, customers, on='Customer_ID', how='left')
print(left_join)

Output:

Order_ID	Customer_ID	Amount	Name	Country
101	C001	250	John	USA
102	C002	400	Alice	Canada
103	C003	150	NaN	NaN

✅ Observations:

The row with Customer_ID = C003 remains in the output but has NaN in Name and Country columns because it has no match in customers.

3. Right Join (`how='right'`)

Keeps all records from the right table (customers) and fills missing values from the left table.

right_join = pd.merge(orders, customers, on='Customer_ID', how='right')
print(right_join)

Output:

Order_ID	Customer_ID	Amount	Name	Country
101	C001	250	John	USA
102	C002	400	Alice	Canada
NaN	C004	NaN	Mark	UK

✅ Observations:

The row with Customer_ID = C004 remains in the output but has NaN in Order_ID and Amount because it has no match in orders.

4. Outer Join (`how='outer'`)

Returns all records from both tables, filling missing values with NaN.

outer_join = pd.merge(orders, customers, on='Customer_ID', how='outer')
print(outer_join)

Output:

Order_ID	Customer_ID	Amount	Name	Country
101	C001	250	John	USA
102	C002	400	Alice	Canada
103	C003	150	NaN	NaN
NaN	C004	NaN	Mark	UK

✅ Observations:

The output contains all records from both tables.
Missing values are filled with NaN.

Step 3: Using `join()` for Index-Based Merging

If the merge key is in the index, join() provides a more convenient way to combine datasets.

Example: Index-Based Join

customers.set_index('Customer_ID', inplace=True)
orders.set_index('Customer_ID', inplace=True)

joined_df = orders.join(customers, how='left')
print(joined_df)

Output:

Order_ID	Amount	Name	Country
101	250	John	USA
102	400	Alice	Canada
103	150	NaN	NaN

✅ Observations:

join() merges on index values, simplifying the syntax.

Step 4: Using `concat()` for Concatenation

The concat() function stacks DataFrames either vertically (rows) or horizontally (columns).

Concatenating Rows (`axis=0`)

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

concat_rows = pd.concat([df1, df2], axis=0)
print(concat_rows)

Output:

A	B
1	3
2	4
5	7
6	8

Concatenating Columns (`axis=1`)

concat_cols = pd.concat([df1, df2], axis=1)
print(concat_cols)

Output:

A	B	A	B
1	3	5	7
2	4	6	8

Merging and Joining DataFrames in Pandas

Introduction

Step 1: Understanding the Need for Merging & Joining

Step 2: Using merge() for SQL-like Joins

Types of Joins in Pandas

Example Dataset Creation

1. Inner Join (how='inner')

2. Left Join (how='left')

3. Right Join (how='right')

4. Outer Join (how='outer')

Step 3: Using join() for Index-Based Merging

Example: Index-Based Join

Step 4: Using concat() for Concatenation

Concatenating Rows (axis=0)

Concatenating Columns (axis=1)

Leave a Reply Cancel reply

Step 2: Using `merge()` for SQL-like Joins

1. Inner Join (`how='inner'`)

2. Left Join (`how='left'`)

3. Right Join (`how='right'`)

4. Outer Join (`how='outer'`)

Step 3: Using `join()` for Index-Based Merging

Step 4: Using `concat()` for Concatenation

Concatenating Rows (`axis=0`)

Concatenating Columns (`axis=1`)