Merging and Joining DataFrames in Pandas
Introduction
Pandas provides powerful functions to combine multiple datasets efficiently. In real-world scenarios, data is often stored in separate tables, requiring merging or joining operations to analyze it effectively.
Pandas offers three primary functions for combining DataFrames:
merge()
: Used for database-style joins (similar to SQL).join()
: Simplifies merging based on index values.concat()
: Stacks DataFrames vertically or horizontally.
Step 1: Understanding the Need for Merging & Joining
When working with multiple datasets, merging is required to:
- Combine related data from different sources.
- Integrate data based on common keys (e.g., Customer ID, Product ID).
- Perform relational operations like inner joins, outer joins, left joins, and right joins.
Example:
We have two tables:
Orders Table
Order_ID | Customer_ID | Amount |
---|---|---|
101 | C001 | 250 |
102 | C002 | 400 |
103 | C003 | 150 |
Customers Table
Customer_ID | Name | Country |
---|---|---|
C001 | John | USA |
C002 | Alice | Canada |
C004 | Mark | UK |
To analyze orders along with customer details, we must merge these tables based on Customer_ID
.
Step 2: Using merge()
for SQL-like Joins
The merge()
function is used to join two DataFrames based on common columns or indices.
Types of Joins in Pandas
Join Type | Description |
---|---|
Inner Join | Returns only matching rows from both tables. |
Left Join | Returns all rows from the left table and matching rows from the right. |
Right Join | Returns all rows from the right table and matching rows from the left. |
Outer Join | Returns all rows from both tables, filling missing values with NaN. |
Example Dataset Creation
import pandas as pd
# Creating Orders DataFrame
orders = pd.DataFrame({
'Order_ID': [101, 102, 103],
'Customer_ID': ['C001', 'C002', 'C003'],
'Amount': [250, 400, 150]
})
# Creating Customers DataFrame
customers = pd.DataFrame({
'Customer_ID': ['C001', 'C002', 'C004'],
'Name': ['John', 'Alice', 'Mark'],
'Country': ['USA', 'Canada', 'UK']
})
1. Inner Join (how='inner'
)
Returns only matching records based on the key column.
inner_join = pd.merge(orders, customers, on='Customer_ID', how='inner')
print(inner_join)
Output:
Order_ID | Customer_ID | Amount | Name | Country |
---|---|---|---|---|
101 | C001 | 250 | John | USA |
102 | C002 | 400 | Alice | Canada |
✅ Observations:
- The row with
Customer_ID = C003
(fromorders
) is removed because it has no match incustomers
. - The row with
Customer_ID = C004
(fromcustomers
) is removed because it has no match inorders
.
2. Left Join (how='left'
)
Keeps all records from the left table (orders
), and fills missing values from the right table with NaN.
left_join = pd.merge(orders, customers, on='Customer_ID', how='left')
print(left_join)
Output:
Order_ID | Customer_ID | Amount | Name | Country |
---|---|---|---|---|
101 | C001 | 250 | John | USA |
102 | C002 | 400 | Alice | Canada |
103 | C003 | 150 | NaN | NaN |
✅ Observations:
- The row with
Customer_ID = C003
remains in the output but hasNaN
inName
andCountry
columns because it has no match incustomers
.
3. Right Join (how='right'
)
Keeps all records from the right table (customers
) and fills missing values from the left table.
right_join = pd.merge(orders, customers, on='Customer_ID', how='right')
print(right_join)
Output:
Order_ID | Customer_ID | Amount | Name | Country |
---|---|---|---|---|
101 | C001 | 250 | John | USA |
102 | C002 | 400 | Alice | Canada |
NaN | C004 | NaN | Mark | UK |
✅ Observations:
- The row with
Customer_ID = C004
remains in the output but hasNaN
inOrder_ID
andAmount
because it has no match inorders
.
4. Outer Join (how='outer'
)
Returns all records from both tables, filling missing values with NaN.
outer_join = pd.merge(orders, customers, on='Customer_ID', how='outer')
print(outer_join)
Output:
Order_ID | Customer_ID | Amount | Name | Country |
---|---|---|---|---|
101 | C001 | 250 | John | USA |
102 | C002 | 400 | Alice | Canada |
103 | C003 | 150 | NaN | NaN |
NaN | C004 | NaN | Mark | UK |
✅ Observations:
- The output contains all records from both tables.
- Missing values are filled with
NaN
.
Step 3: Using join()
for Index-Based Merging
If the merge key is in the index, join()
provides a more convenient way to combine datasets.
Example: Index-Based Join
customers.set_index('Customer_ID', inplace=True)
orders.set_index('Customer_ID', inplace=True)
joined_df = orders.join(customers, how='left')
print(joined_df)
Output:
Order_ID | Amount | Name | Country |
---|---|---|---|
101 | 250 | John | USA |
102 | 400 | Alice | Canada |
103 | 150 | NaN | NaN |
✅ Observations:
join()
merges on index values, simplifying the syntax.
Step 4: Using concat()
for Concatenation
The concat()
function stacks DataFrames either vertically (rows) or horizontally (columns).
Concatenating Rows (axis=0
)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concat_rows = pd.concat([df1, df2], axis=0)
print(concat_rows)
Output:
A | B |
---|---|
1 | 3 |
2 | 4 |
5 | 7 |
6 | 8 |
Concatenating Columns (axis=1
)
concat_cols = pd.concat([df1, df2], axis=1)
print(concat_cols)
Output:
A | B | A | B |
---|---|---|---|
1 | 3 | 5 | 7 |
2 | 4 | 6 | 8 |