![]()
Merging and Joining DataFrames in Pandas
Introduction
Pandas provides powerful functions to combine multiple datasets efficiently. In real-world scenarios, data is often stored in separate tables, requiring merging or joining operations to analyze it effectively.
Pandas offers three primary functions for combining DataFrames:
merge(): Used for database-style joins (similar to SQL).join(): Simplifies merging based on index values.concat(): Stacks DataFrames vertically or horizontally.
Step 1: Understanding the Need for Merging & Joining
When working with multiple datasets, merging is required to:
- Combine related data from different sources.
- Integrate data based on common keys (e.g., Customer ID, Product ID).
- Perform relational operations like inner joins, outer joins, left joins, and right joins.
Example:
We have two tables:
Orders Table
| Order_ID | Customer_ID | Amount |
|---|---|---|
| 101 | C001 | 250 |
| 102 | C002 | 400 |
| 103 | C003 | 150 |
Customers Table
| Customer_ID | Name | Country |
|---|---|---|
| C001 | John | USA |
| C002 | Alice | Canada |
| C004 | Mark | UK |
To analyze orders along with customer details, we must merge these tables based on Customer_ID.
Step 2: Using merge() for SQL-like Joins
The merge() function is used to join two DataFrames based on common columns or indices.
Types of Joins in Pandas
| Join Type | Description |
|---|---|
| Inner Join | Returns only matching rows from both tables. |
| Left Join | Returns all rows from the left table and matching rows from the right. |
| Right Join | Returns all rows from the right table and matching rows from the left. |
| Outer Join | Returns all rows from both tables, filling missing values with NaN. |
Example Dataset Creation
import pandas as pd
# Creating Orders DataFrame
orders = pd.DataFrame({
'Order_ID': [101, 102, 103],
'Customer_ID': ['C001', 'C002', 'C003'],
'Amount': [250, 400, 150]
})
# Creating Customers DataFrame
customers = pd.DataFrame({
'Customer_ID': ['C001', 'C002', 'C004'],
'Name': ['John', 'Alice', 'Mark'],
'Country': ['USA', 'Canada', 'UK']
})
1. Inner Join (how='inner')
Returns only matching records based on the key column.
inner_join = pd.merge(orders, customers, on='Customer_ID', how='inner')
print(inner_join)
Output:
| Order_ID | Customer_ID | Amount | Name | Country |
|---|---|---|---|---|
| 101 | C001 | 250 | John | USA |
| 102 | C002 | 400 | Alice | Canada |
✅ Observations:
- The row with
Customer_ID = C003(fromorders) is removed because it has no match incustomers. - The row with
Customer_ID = C004(fromcustomers) is removed because it has no match inorders.
2. Left Join (how='left')
Keeps all records from the left table (orders), and fills missing values from the right table with NaN.
left_join = pd.merge(orders, customers, on='Customer_ID', how='left')
print(left_join)
Output:
| Order_ID | Customer_ID | Amount | Name | Country |
|---|---|---|---|---|
| 101 | C001 | 250 | John | USA |
| 102 | C002 | 400 | Alice | Canada |
| 103 | C003 | 150 | NaN | NaN |
✅ Observations:
- The row with
Customer_ID = C003remains in the output but hasNaNinNameandCountrycolumns because it has no match incustomers.
3. Right Join (how='right')
Keeps all records from the right table (customers) and fills missing values from the left table.
right_join = pd.merge(orders, customers, on='Customer_ID', how='right')
print(right_join)
Output:
| Order_ID | Customer_ID | Amount | Name | Country |
|---|---|---|---|---|
| 101 | C001 | 250 | John | USA |
| 102 | C002 | 400 | Alice | Canada |
| NaN | C004 | NaN | Mark | UK |
✅ Observations:
- The row with
Customer_ID = C004remains in the output but hasNaNinOrder_IDandAmountbecause it has no match inorders.
4. Outer Join (how='outer')
Returns all records from both tables, filling missing values with NaN.
outer_join = pd.merge(orders, customers, on='Customer_ID', how='outer')
print(outer_join)
Output:
| Order_ID | Customer_ID | Amount | Name | Country |
|---|---|---|---|---|
| 101 | C001 | 250 | John | USA |
| 102 | C002 | 400 | Alice | Canada |
| 103 | C003 | 150 | NaN | NaN |
| NaN | C004 | NaN | Mark | UK |
✅ Observations:
- The output contains all records from both tables.
- Missing values are filled with
NaN.
Step 3: Using join() for Index-Based Merging
If the merge key is in the index, join() provides a more convenient way to combine datasets.
Example: Index-Based Join
customers.set_index('Customer_ID', inplace=True)
orders.set_index('Customer_ID', inplace=True)
joined_df = orders.join(customers, how='left')
print(joined_df)
Output:
| Order_ID | Amount | Name | Country |
|---|---|---|---|
| 101 | 250 | John | USA |
| 102 | 400 | Alice | Canada |
| 103 | 150 | NaN | NaN |
✅ Observations:
join()merges on index values, simplifying the syntax.
Step 4: Using concat() for Concatenation
The concat() function stacks DataFrames either vertically (rows) or horizontally (columns).
Concatenating Rows (axis=0)
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
concat_rows = pd.concat([df1, df2], axis=0)
print(concat_rows)
Output:
| A | B |
|---|---|
| 1 | 3 |
| 2 | 4 |
| 5 | 7 |
| 6 | 8 |
Concatenating Columns (axis=1)
concat_cols = pd.concat([df1, df2], axis=1)
print(concat_cols)
Output:
| A | B | A | B |
|---|---|---|---|
| 1 | 3 | 5 | 7 |
| 2 | 4 | 6 | 8 |
