Data Encoding Techniques (One-Hot Encoding, Label Encoding)

Loading

Data Encoding Techniques: One-Hot Encoding & Label Encoding

Introduction to Data Encoding

Data encoding is a crucial preprocessing step in machine learning, where categorical data is converted into a numerical format that models can understand. Most machine learning algorithms require numerical input, so categorical data needs to be transformed accordingly.

There are multiple encoding techniques, but the two most commonly used are:

  1. Label Encoding
  2. One-Hot Encoding

Each method has its use case, advantages, and disadvantages. Below, I will discuss each in detail with step-by-step explanations.


1. Label Encoding

Label Encoding is a technique used to convert categorical values into numerical values by assigning a unique integer to each category.

Step-by-Step Process of Label Encoding

Step 1: Identify Categorical Data

Categorical data consists of non-numeric values such as color names, city names, or product types.

Example: Let’s assume we have a dataset with a categorical column “Color”:

Color
Red
Blue
Green
Blue
Red

Step 2: Assign Unique Numeric Values

Each unique category is assigned an integer value:

  • Red → 0
  • Blue → 1
  • Green → 2

Step 3: Replace Categorical Values with Numbers

The categorical column is now replaced with its corresponding numerical values:

ColorEncoded Color
Red0
Blue1
Green2
Blue1
Red0

Step 4: Apply Label Encoding Using Python

In Python, we can use LabelEncoder from the sklearn.preprocessing library.

from sklearn.preprocessing import LabelEncoder

# Sample data
colors = ['Red', 'Blue', 'Green', 'Blue', 'Red']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_colors = label_encoder.fit_transform(colors)

print(encoded_colors)

Output:

[0 1 2 1 0]

Advantages of Label Encoding

  • Simple and easy to implement.
  • Works well with ordinal data (where order matters, e.g., low, medium, high).
  • Requires less memory compared to one-hot encoding.

Disadvantages of Label Encoding

  • May introduce an unintended ordinal relationship between categories.
  • The model may mistakenly assume a numerical relationship (e.g., Red < Blue < Green), leading to incorrect predictions.

2. One-Hot Encoding

One-Hot Encoding is an alternative technique where each category is represented as a separate binary column (0 or 1).

Step-by-Step Process of One-Hot Encoding

Step 1: Identify Categorical Data

Using the same “Color” example:

Color
Red
Blue
Green
Blue
Red

Step 2: Create Binary Columns

A new column is created for each unique category, and values are assigned as follows:

ColorRedBlueGreen
Red100
Blue010
Green001
Blue010
Red100

Each original category is now transformed into multiple columns with binary values.

Step 3: Apply One-Hot Encoding Using Python

We can use OneHotEncoder from sklearn.preprocessing or pd.get_dummies from pandas.

Using pandas

import pandas as pd

# Sample data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Apply One-Hot Encoding
encoded_df = pd.get_dummies(df, columns=['Color'])

print(encoded_df)

Output:

   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1

Using OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

# Sample data
colors = [['Red'], ['Blue'], ['Green'], ['Blue'], ['Red']]

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the data
encoded_colors = one_hot_encoder.fit_transform(colors)

print(encoded_colors)

Output:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

Advantages of One-Hot Encoding

  • Avoids the ordinal relationship problem in label encoding.
  • Works well for categorical data where no intrinsic order exists.

Disadvantages of One-Hot Encoding

  • Can create too many columns if there are many unique categories (high dimensionality).
  • Requires more memory and computational power.

When to Use Label Encoding vs. One-Hot Encoding

CriteriaLabel EncodingOne-Hot Encoding
Data TypeOrdinal (e.g., Low, Medium, High)Nominal (e.g., Color, City)
Number of CategoriesFew unique valuesMany unique values
Computational CostLowHigh (if too many categories)
Memory UsageEfficientCan be high if many categories
Model TypeTree-based models (Decision Trees, Random Forest)Neural Networks, Linear Models

Leave a Reply

Your email address will not be published. Required fields are marked *