Merging and Joining DataFrames in Pandas

Loading

Merging and Joining DataFrames in Pandas

Introduction

Pandas provides powerful functions to combine multiple datasets efficiently. In real-world scenarios, data is often stored in separate tables, requiring merging or joining operations to analyze it effectively.

Pandas offers three primary functions for combining DataFrames:

  • merge(): Used for database-style joins (similar to SQL).
  • join(): Simplifies merging based on index values.
  • concat(): Stacks DataFrames vertically or horizontally.

Step 1: Understanding the Need for Merging & Joining

When working with multiple datasets, merging is required to:

  • Combine related data from different sources.
  • Integrate data based on common keys (e.g., Customer ID, Product ID).
  • Perform relational operations like inner joins, outer joins, left joins, and right joins.

Example:
We have two tables:
Orders Table

Order_IDCustomer_IDAmount
101C001250
102C002400
103C003150

Customers Table

Customer_IDNameCountry
C001JohnUSA
C002AliceCanada
C004MarkUK

To analyze orders along with customer details, we must merge these tables based on Customer_ID.


Step 2: Using merge() for SQL-like Joins

The merge() function is used to join two DataFrames based on common columns or indices.

Types of Joins in Pandas

Join TypeDescription
Inner JoinReturns only matching rows from both tables.
Left JoinReturns all rows from the left table and matching rows from the right.
Right JoinReturns all rows from the right table and matching rows from the left.
Outer JoinReturns all rows from both tables, filling missing values with NaN.

Example Dataset Creation

import pandas as pd

# Creating Orders DataFrame
orders = pd.DataFrame({
    'Order_ID': [101, 102, 103],
    'Customer_ID': ['C001', 'C002', 'C003'],
    'Amount': [250, 400, 150]
})

# Creating Customers DataFrame
customers = pd.DataFrame({
    'Customer_ID': ['C001', 'C002', 'C004'],
    'Name': ['John', 'Alice', 'Mark'],
    'Country': ['USA', 'Canada', 'UK']
})

1. Inner Join (how='inner')

Returns only matching records based on the key column.

inner_join = pd.merge(orders, customers, on='Customer_ID', how='inner')
print(inner_join)

Output:

Order_IDCustomer_IDAmountNameCountry
101C001250JohnUSA
102C002400AliceCanada

Observations:

  • The row with Customer_ID = C003 (from orders) is removed because it has no match in customers.
  • The row with Customer_ID = C004 (from customers) is removed because it has no match in orders.

2. Left Join (how='left')

Keeps all records from the left table (orders), and fills missing values from the right table with NaN.

left_join = pd.merge(orders, customers, on='Customer_ID', how='left')
print(left_join)

Output:

Order_IDCustomer_IDAmountNameCountry
101C001250JohnUSA
102C002400AliceCanada
103C003150NaNNaN

Observations:

  • The row with Customer_ID = C003 remains in the output but has NaN in Name and Country columns because it has no match in customers.

3. Right Join (how='right')

Keeps all records from the right table (customers) and fills missing values from the left table.

right_join = pd.merge(orders, customers, on='Customer_ID', how='right')
print(right_join)

Output:

Order_IDCustomer_IDAmountNameCountry
101C001250JohnUSA
102C002400AliceCanada
NaNC004NaNMarkUK

Observations:

  • The row with Customer_ID = C004 remains in the output but has NaN in Order_ID and Amount because it has no match in orders.

4. Outer Join (how='outer')

Returns all records from both tables, filling missing values with NaN.

outer_join = pd.merge(orders, customers, on='Customer_ID', how='outer')
print(outer_join)

Output:

Order_IDCustomer_IDAmountNameCountry
101C001250JohnUSA
102C002400AliceCanada
103C003150NaNNaN
NaNC004NaNMarkUK

Observations:

  • The output contains all records from both tables.
  • Missing values are filled with NaN.

Step 3: Using join() for Index-Based Merging

If the merge key is in the index, join() provides a more convenient way to combine datasets.

Example: Index-Based Join

customers.set_index('Customer_ID', inplace=True)
orders.set_index('Customer_ID', inplace=True)

joined_df = orders.join(customers, how='left')
print(joined_df)

Output:

Order_IDAmountNameCountry
101250JohnUSA
102400AliceCanada
103150NaNNaN

Observations:

  • join() merges on index values, simplifying the syntax.

Step 4: Using concat() for Concatenation

The concat() function stacks DataFrames either vertically (rows) or horizontally (columns).

Concatenating Rows (axis=0)

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

concat_rows = pd.concat([df1, df2], axis=0)
print(concat_rows)

Output:

AB
13
24
57
68

Concatenating Columns (axis=1)

concat_cols = pd.concat([df1, df2], axis=1)
print(concat_cols)

Output:

ABAB
1357
2468

Leave a Reply

Your email address will not be published. Required fields are marked *