Data Encoding Techniques: One-Hot Encoding & Label Encoding
Introduction to Data Encoding
Data encoding is a crucial preprocessing step in machine learning, where categorical data is converted into a numerical format that models can understand. Most machine learning algorithms require numerical input, so categorical data needs to be transformed accordingly.
There are multiple encoding techniques, but the two most commonly used are:
- Label Encoding
- One-Hot Encoding
Each method has its use case, advantages, and disadvantages. Below, I will discuss each in detail with step-by-step explanations.
1. Label Encoding
Label Encoding is a technique used to convert categorical values into numerical values by assigning a unique integer to each category.
Step-by-Step Process of Label Encoding
Step 1: Identify Categorical Data
Categorical data consists of non-numeric values such as color names, city names, or product types.
Example: Let’s assume we have a dataset with a categorical column “Color”:
Color |
---|
Red |
Blue |
Green |
Blue |
Red |
Step 2: Assign Unique Numeric Values
Each unique category is assigned an integer value:
- Red → 0
- Blue → 1
- Green → 2
Step 3: Replace Categorical Values with Numbers
The categorical column is now replaced with its corresponding numerical values:
Color | Encoded Color |
---|---|
Red | 0 |
Blue | 1 |
Green | 2 |
Blue | 1 |
Red | 0 |
Step 4: Apply Label Encoding Using Python
In Python, we can use LabelEncoder
from the sklearn.preprocessing
library.
from sklearn.preprocessing import LabelEncoder
# Sample data
colors = ['Red', 'Blue', 'Green', 'Blue', 'Red']
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Fit and transform the data
encoded_colors = label_encoder.fit_transform(colors)
print(encoded_colors)
Output:
[0 1 2 1 0]
Advantages of Label Encoding
- Simple and easy to implement.
- Works well with ordinal data (where order matters, e.g., low, medium, high).
- Requires less memory compared to one-hot encoding.
Disadvantages of Label Encoding
- May introduce an unintended ordinal relationship between categories.
- The model may mistakenly assume a numerical relationship (e.g., Red < Blue < Green), leading to incorrect predictions.
2. One-Hot Encoding
One-Hot Encoding is an alternative technique where each category is represented as a separate binary column (0 or 1).
Step-by-Step Process of One-Hot Encoding
Step 1: Identify Categorical Data
Using the same “Color” example:
Color |
---|
Red |
Blue |
Green |
Blue |
Red |
Step 2: Create Binary Columns
A new column is created for each unique category, and values are assigned as follows:
Color | Red | Blue | Green |
---|---|---|---|
Red | 1 | 0 | 0 |
Blue | 0 | 1 | 0 |
Green | 0 | 0 | 1 |
Blue | 0 | 1 | 0 |
Red | 1 | 0 | 0 |
Each original category is now transformed into multiple columns with binary values.
Step 3: Apply One-Hot Encoding Using Python
We can use OneHotEncoder
from sklearn.preprocessing
or pd.get_dummies
from pandas.
Using pandas
import pandas as pd
# Sample data
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
# Apply One-Hot Encoding
encoded_df = pd.get_dummies(df, columns=['Color'])
print(encoded_df)
Output:
Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1
Using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# Sample data
colors = [['Red'], ['Blue'], ['Green'], ['Blue'], ['Red']]
# Initialize OneHotEncoder
one_hot_encoder = OneHotEncoder(sparse=False)
# Fit and transform the data
encoded_colors = one_hot_encoder.fit_transform(colors)
print(encoded_colors)
Output:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]
Advantages of One-Hot Encoding
- Avoids the ordinal relationship problem in label encoding.
- Works well for categorical data where no intrinsic order exists.
Disadvantages of One-Hot Encoding
- Can create too many columns if there are many unique categories (high dimensionality).
- Requires more memory and computational power.
When to Use Label Encoding vs. One-Hot Encoding
Criteria | Label Encoding | One-Hot Encoding |
---|---|---|
Data Type | Ordinal (e.g., Low, Medium, High) | Nominal (e.g., Color, City) |
Number of Categories | Few unique values | Many unique values |
Computational Cost | Low | High (if too many categories) |
Memory Usage | Efficient | Can be high if many categories |
Model Type | Tree-based models (Decision Trees, Random Forest) | Neural Networks, Linear Models |