Handling Categorical Data in Machine Learning Using Pandas
Introduction
Categorical data represents discrete values that belong to a limited set of categories or labels. It is common in real-world datasets, such as:
- Gender: Male, Female, Other
- Color: Red, Green, Blue
- Education Level: High School, Bachelor’s, Master’s, PhD
- City Names: New York, London, Tokyo
Most machine learning algorithms cannot work with categorical data directly. Instead, categorical values must be transformed into numerical representations. This guide will cover:
- Types of Categorical Data
- Identifying Categorical Data in Pandas
- Converting Categorical Data to Numerical Format
- Encoding Techniques (Label Encoding, One-Hot Encoding, etc.)
- Handling Missing Categorical Values
- Feature Engineering for Categorical Data
- Best Practices for Encoding
Step 1: Understanding Types of Categorical Data
Categorical data can be divided into two main types:
1. Nominal Data (No Order or Ranking)
- Categories do not have an inherent order.
- Example: Color (
Red
,Blue
,Green
), Gender (Male
,Female
).
2. Ordinal Data (Ordered or Ranked)
- Categories have a meaningful order or ranking.
- Example: Education Level (
High School < Bachelor’s < Master’s < PhD
), Satisfaction Level (Low < Medium < High
).
Step 2: Identifying Categorical Data in Pandas
import pandas as pd
# Sample dataset
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Gender': ['Female', 'Male', 'Male', 'Female'],
'Education': ['Bachelor’s', 'Master’s', 'PhD', 'High School'],
'City': ['New York', 'London', 'Tokyo', 'London']
}
df = pd.DataFrame(data)
print(df.dtypes)
Output:
Name object
Gender object
Education object
City object
dtype: object
✅ Observation:
- Categorical data is stored as
object
type in Pandas.
Convert object
columns to category
type
df['Gender'] = df['Gender'].astype('category')
df['Education'] = df['Education'].astype('category')
print(df.dtypes)
Output:
Name object
Gender category
Education category
City object
dtype: object
✅ Why Convert to category
?
- Saves memory.
- Enables efficient categorical operations.
Step 3: Converting Categorical Data into Numeric Format
Machine learning models require numerical values. Let’s explore various encoding techniques.
Step 4: Encoding Techniques
1. Label Encoding (Ordinal Encoding)
Assigns a unique integer to each category. Suitable for ordinal data.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Education_LabelEncoded'] = le.fit_transform(df['Education'])
print(df[['Education', 'Education_LabelEncoded']])
Output:
Education Education_LabelEncoded
0 Bachelor’s 0
1 Master’s 2
2 PhD 3
3 High School 1
✅ Drawback:
- Can introduce artificial ranking in nominal data.
2. One-Hot Encoding (OHE)
Creates binary columns for each category. Suitable for nominal data.
df_one_hot = pd.get_dummies(df, columns=['Gender', 'City'])
print(df_one_hot)
Output:
Name Education Gender_Female Gender_Male City_London City_New York City_Tokyo
0 Alice Bachelor’s 1 0 0 1 0
1 Bob Master’s 0 1 1 0 0
2 Charlie PhD 0 1 0 0 1
3 David High School 1 0 1 0 0
✅ Advantages:
- No artificial ranking.
- Works well for nominal categorical features.
✅ Disadvantages:
- Curse of dimensionality (Too many columns for high-cardinality data).
✅ Solution:
- Use
drop_first=True
to reduce one column.
df_one_hot = pd.get_dummies(df, columns=['Gender', 'City'], drop_first=True)
3. Frequency Encoding
Replaces categories with their frequency.
df['City_Frequency'] = df['City'].map(df['City'].value_counts())
print(df[['City', 'City_Frequency']])
Output:
City City_Frequency
0 New York 1
1 London 2
2 Tokyo 1
3 London 2
✅ Useful for:
- High-cardinality features with repeated values.
4. Target Encoding (Mean Encoding)
Replaces categories with the mean of the target variable.
Example: If predicting salary based on city, we replace city with the average salary in that city.
target = {'New York': 70000, 'London': 65000, 'Tokyo': 60000}
df['City_TargetEncoded'] = df['City'].map(target)
print(df[['City', 'City_TargetEncoded']])
Output:
City City_TargetEncoded
0 New York 70000
1 London 65000
2 Tokyo 60000
3 London 65000
✅ Used in:
- High-cardinality categorical features.
- Tree-based models (like XGBoost, Random Forest).
🚨 Warning:
- Can lead to data leakage if computed before data splitting.
Step 5: Handling Missing Categorical Values
1. Fill Missing Values with Mode (Most Frequent Category)
df['City'].fillna(df['City'].mode()[0], inplace=True)
2. Assign a New Category (Unknown
)
df['City'].fillna('Unknown', inplace=True)
3. Use Frequency Encoding Before Filling Missing Values
df['City_Frequency'] = df['City'].map(df['City'].value_counts())
df['City_Frequency'].fillna(0, inplace=True)
Step 6: Best Practices for Encoding Categorical Data
Encoding Method | Suitable For | Pros | Cons |
---|---|---|---|
Label Encoding | Ordinal Data | Simple, No extra columns | Adds ranking to categories |
One-Hot Encoding | Nominal Data | No ranking issue | Increases dimensions |
Frequency Encoding | High-Cardinality Data | Keeps original feature count | May lose interpretability |
Target Encoding | High-Cardinality Data (Target-Based) | Useful for tree models | Can cause data leakage |