Handling Categorical Data

Loading

Handling Categorical Data in Machine Learning Using Pandas

Introduction

Categorical data represents discrete values that belong to a limited set of categories or labels. It is common in real-world datasets, such as:

  • Gender: Male, Female, Other
  • Color: Red, Green, Blue
  • Education Level: High School, Bachelor’s, Master’s, PhD
  • City Names: New York, London, Tokyo

Most machine learning algorithms cannot work with categorical data directly. Instead, categorical values must be transformed into numerical representations. This guide will cover:

  1. Types of Categorical Data
  2. Identifying Categorical Data in Pandas
  3. Converting Categorical Data to Numerical Format
  4. Encoding Techniques (Label Encoding, One-Hot Encoding, etc.)
  5. Handling Missing Categorical Values
  6. Feature Engineering for Categorical Data
  7. Best Practices for Encoding

Step 1: Understanding Types of Categorical Data

Categorical data can be divided into two main types:

1. Nominal Data (No Order or Ranking)

  • Categories do not have an inherent order.
  • Example: Color (Red, Blue, Green), Gender (Male, Female).

2. Ordinal Data (Ordered or Ranked)

  • Categories have a meaningful order or ranking.
  • Example: Education Level (High School < Bachelor’s < Master’s < PhD), Satisfaction Level (Low < Medium < High).

Step 2: Identifying Categorical Data in Pandas

import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Gender': ['Female', 'Male', 'Male', 'Female'],
    'Education': ['Bachelor’s', 'Master’s', 'PhD', 'High School'],
    'City': ['New York', 'London', 'Tokyo', 'London']
}

df = pd.DataFrame(data)
print(df.dtypes)

Output:

Name         object
Gender       object
Education    object
City         object
dtype: object

Observation:

  • Categorical data is stored as object type in Pandas.

Convert object columns to category type

df['Gender'] = df['Gender'].astype('category')
df['Education'] = df['Education'].astype('category')

print(df.dtypes)

Output:

Name           object
Gender       category
Education    category
City          object
dtype: object

Why Convert to category?

  • Saves memory.
  • Enables efficient categorical operations.

Step 3: Converting Categorical Data into Numeric Format

Machine learning models require numerical values. Let’s explore various encoding techniques.


Step 4: Encoding Techniques

1. Label Encoding (Ordinal Encoding)

Assigns a unique integer to each category. Suitable for ordinal data.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Education_LabelEncoded'] = le.fit_transform(df['Education'])
print(df[['Education', 'Education_LabelEncoded']])

Output:

    Education  Education_LabelEncoded
0  Bachelor’s                      0
1     Master’s                      2
2         PhD                      3
3  High School                      1

Drawback:

  • Can introduce artificial ranking in nominal data.

2. One-Hot Encoding (OHE)

Creates binary columns for each category. Suitable for nominal data.

df_one_hot = pd.get_dummies(df, columns=['Gender', 'City'])
print(df_one_hot)

Output:

     Name    Education  Gender_Female  Gender_Male  City_London  City_New York  City_Tokyo
0   Alice  Bachelor’s              1            0            0              1           0
1     Bob     Master’s              0            1            1              0           0
2  Charlie         PhD              0            1            0              0           1
3   David  High School              1            0            1              0           0

Advantages:

  • No artificial ranking.
  • Works well for nominal categorical features.

Disadvantages:

  • Curse of dimensionality (Too many columns for high-cardinality data).

Solution:

  • Use drop_first=True to reduce one column.
df_one_hot = pd.get_dummies(df, columns=['Gender', 'City'], drop_first=True)

3. Frequency Encoding

Replaces categories with their frequency.

df['City_Frequency'] = df['City'].map(df['City'].value_counts())
print(df[['City', 'City_Frequency']])

Output:

    City  City_Frequency
0  New York              1
1    London              2
2    Tokyo              1
3    London              2

Useful for:

  • High-cardinality features with repeated values.

4. Target Encoding (Mean Encoding)

Replaces categories with the mean of the target variable.

Example: If predicting salary based on city, we replace city with the average salary in that city.

target = {'New York': 70000, 'London': 65000, 'Tokyo': 60000}
df['City_TargetEncoded'] = df['City'].map(target)
print(df[['City', 'City_TargetEncoded']])

Output:

    City  City_TargetEncoded
0  New York             70000
1    London             65000
2    Tokyo             60000
3    London             65000

Used in:

  • High-cardinality categorical features.
  • Tree-based models (like XGBoost, Random Forest).

🚨 Warning:

  • Can lead to data leakage if computed before data splitting.

Step 5: Handling Missing Categorical Values

1. Fill Missing Values with Mode (Most Frequent Category)

df['City'].fillna(df['City'].mode()[0], inplace=True)

2. Assign a New Category (Unknown)

df['City'].fillna('Unknown', inplace=True)

3. Use Frequency Encoding Before Filling Missing Values

df['City_Frequency'] = df['City'].map(df['City'].value_counts())
df['City_Frequency'].fillna(0, inplace=True)

Step 6: Best Practices for Encoding Categorical Data

Encoding MethodSuitable ForProsCons
Label EncodingOrdinal DataSimple, No extra columnsAdds ranking to categories
One-Hot EncodingNominal DataNo ranking issueIncreases dimensions
Frequency EncodingHigh-Cardinality DataKeeps original feature countMay lose interpretability
Target EncodingHigh-Cardinality Data (Target-Based)Useful for tree modelsCan cause data leakage

Leave a Reply

Your email address will not be published. Required fields are marked *