Data Encoding Techniques: One-Hot Encoding & Label Encoding

Introduction to Data Encoding

Data encoding is a crucial preprocessing step in machine learning, where categorical data is converted into a numerical format that models can understand. Most machine learning algorithms require numerical input, so categorical data needs to be transformed accordingly.

There are multiple encoding techniques, but the two most commonly used are:

Label Encoding
One-Hot Encoding

Each method has its use case, advantages, and disadvantages. Below, I will discuss each in detail with step-by-step explanations.

1. Label Encoding

Label Encoding is a technique used to convert categorical values into numerical values by assigning a unique integer to each category.

Step-by-Step Process of Label Encoding

Step 1: Identify Categorical Data

Categorical data consists of non-numeric values such as color names, city names, or product types.

Example: Let’s assume we have a dataset with a categorical column “Color”:

Color
Red
Blue
Green
Blue
Red

Step 2: Assign Unique Numeric Values

Each unique category is assigned an integer value:

Red → 0
Blue → 1
Green → 2

Step 3: Replace Categorical Values with Numbers

The categorical column is now replaced with its corresponding numerical values:

Color	Encoded Color
Red	0
Blue	1
Green	2
Blue	1
Red	0

Step 4: Apply Label Encoding Using Python

In Python, we can use LabelEncoder from the sklearn.preprocessing library.

from sklearn.preprocessing import LabelEncoder

# Sample data
colors = ['Red', 'Blue', 'Green', 'Blue', 'Red']

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the data
encoded_colors = label_encoder.fit_transform(colors)

print(encoded_colors)

Output:

[0 1 2 1 0]

Advantages of Label Encoding

Simple and easy to implement.
Works well with ordinal data (where order matters, e.g., low, medium, high).
Requires less memory compared to one-hot encoding.

Disadvantages of Label Encoding

May introduce an unintended ordinal relationship between categories.
The model may mistakenly assume a numerical relationship (e.g., Red < Blue < Green), leading to incorrect predictions.

2. One-Hot Encoding

One-Hot Encoding is an alternative technique where each category is represented as a separate binary column (0 or 1).

Step-by-Step Process of One-Hot Encoding

Step 1: Identify Categorical Data

Using the same “Color” example:

Color
Red
Blue
Green
Blue
Red

Step 2: Create Binary Columns

A new column is created for each unique category, and values are assigned as follows:

Color	Red	Blue	Green
Red	1	0	0
Blue	0	1	0
Green	0	0	1
Blue	0	1	0
Red	1	0	0

Each original category is now transformed into multiple columns with binary values.

Step 3: Apply One-Hot Encoding Using Python

We can use OneHotEncoder from sklearn.preprocessing or pd.get_dummies from pandas.

Using pandas

import pandas as pd

# Sample data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Apply One-Hot Encoding
encoded_df = pd.get_dummies(df, columns=['Color'])

print(encoded_df)

Output:

   Color_Blue  Color_Green  Color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           1            0          0
4           0            0          1

Using OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

# Sample data
colors = [['Red'], ['Blue'], ['Green'], ['Blue'], ['Red']]

# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)

# Fit and transform the data
encoded_colors = one_hot_encoder.fit_transform(colors)

print(encoded_colors)

Output:

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

Advantages of One-Hot Encoding

Avoids the ordinal relationship problem in label encoding.
Works well for categorical data where no intrinsic order exists.

Disadvantages of One-Hot Encoding

Can create too many columns if there are many unique categories (high dimensionality).
Requires more memory and computational power.

When to Use Label Encoding vs. One-Hot Encoding

Criteria	Label Encoding	One-Hot Encoding
Data Type	Ordinal (e.g., Low, Medium, High)	Nominal (e.g., Color, City)
Number of Categories	Few unique values	Many unique values
Computational Cost	Low	High (if too many categories)
Memory Usage	Efficient	Can be high if many categories
Model Type	Tree-based models (Decision Trees, Random Forest)	Neural Networks, Linear Models

Data Encoding Techniques (One-Hot Encoding, Label Encoding)

Data Encoding Techniques: One-Hot Encoding & Label Encoding

Introduction to Data Encoding

1. Label Encoding

Step-by-Step Process of Label Encoding

Step 1: Identify Categorical Data

Step 2: Assign Unique Numeric Values

Step 3: Replace Categorical Values with Numbers

Step 4: Apply Label Encoding Using Python

Advantages of Label Encoding

Disadvantages of Label Encoding

2. One-Hot Encoding

Step-by-Step Process of One-Hot Encoding

Step 1: Identify Categorical Data

Step 2: Create Binary Columns

Step 3: Apply One-Hot Encoding Using Python

Using pandas

Using OneHotEncoder

Advantages of One-Hot Encoding

Disadvantages of One-Hot Encoding

When to Use Label Encoding vs. One-Hot Encoding

Leave a Reply Cancel reply