Not normalizing data before training ML models

Data normalization is a crucial preprocessing step in machine learning. If skipped, it can negatively affect model performance, particularly for algorithms that rely on feature scaling. In this guide, I will explain why normalization is important, how it impacts different machine learning models, and the proper ways to normalize data.

Step 1: Understanding Data Normalization

What is Data Normalization?

Normalization is the process of scaling numerical features to a common range, typically between 0 and 1 or -1 and 1. It ensures that all features contribute equally to the model’s learning process.

Why is Normalization Important?

Prevents Large Values from Dominating – Features with larger magnitudes can overshadow smaller ones, leading to biased predictions.
Improves Convergence Speed – Many optimization algorithms, such as gradient descent, work better when features are on a similar scale.
Enhances Model Performance – Many machine learning models assume that data is normally distributed; normalization helps meet this assumption.

Step 2: What Happens When You Don’t Normalize Data?

1. Poor Performance in Distance-Based Algorithms

Many machine learning models rely on distance calculations, such as:

K-Nearest Neighbors (KNN)
Support Vector Machines (SVM)
K-Means Clustering

If features have different scales, the model will be biased toward features with larger values.

Example of KNN Without Normalization

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Creating a dataset with different scales
X = np.array([[1, 1000], [2, 1500], [3, 3000], [4, 5000], [5, 10000]])
y = np.array([0, 1, 0, 1, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training KNN without normalization
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

Issue:

The feature with values in the thousands (second column) dominates the smaller one.
Distance calculations become skewed, leading to poor predictions.

2. Slower and Inefficient Gradient Descent in Neural Networks

Deep learning models use gradient descent to minimize loss functions. If feature values differ widely, gradient updates become inefficient.

Effect on Training

Without normalization: Large-scale features cause huge updates, making learning unstable.
With normalization: Model trains faster with stable updates.

Example of Normalization Impact on Neural Networks

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy as np

# Sample dataset with unnormalized data
X = np.array([[1, 1000], [2, 1500], [3, 3000], [4, 5000], [5, 10000]], dtype=np.float32)
y = np.array([0, 1, 0, 1, 0], dtype=np.float32)

# Defining a simple neural network
model = Sequential([
    Dense(10, activation="relu", input_shape=(2,)),
    Dense(1, activation="sigmoid")
])

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Training without normalization
model.fit(X, y, epochs=5, verbose=1)

Issues Observed Without Normalization:

Loss fluctuates or does not decrease properly.
Training takes longer to converge.

3. Incorrect Weight Assignments in Linear Models

Linear models like Logistic Regression and Linear Regression assume that all features contribute equally. If features are on different scales:

The model assigns larger weights to high-magnitude features.
Predictions become biased toward those features.

Example: Linear Regression Without Normalization

from sklearn.linear_model import LinearRegression
import numpy as np

# Example dataset
X = np.array([[1, 100], [2, 200], [3, 500], [4, 800], [5, 1000]])
y = np.array([10, 20, 30, 40, 50])

# Training Linear Regression
model = LinearRegression()
model.fit(X, y)

print("Coefficients:", model.coef_)

Issue:

The feature with the larger scale (column 2) gets a higher coefficient.
Model interprets it as more important, leading to biased results.

Step 3: How to Normalize Data Properly?

1. Min-Max Scaling (0 to 1)

Scales data between 0 and 1:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

Best for: Neural networks and models requiring bounded inputs.

2. Standardization (Z-Score Normalization)

Transforms data to have a mean of 0 and standard deviation of 1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Best for: Linear regression, logistic regression, and SVM.

3. Robust Scaling (Handles Outliers)

Uses median and interquartile range to scale:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

Best for: Datasets with outliers.

Step 4: When Not to Normalize?

Tree-Based Models (Decision Trees, Random Forest, XGBoost)
- These models are not sensitive to feature scaling.
- Normalization provides no benefit.
Categorical Features
- One-hot encoded variables should not be normalized.
Data with Meaningful Scales
- Example: Age and income may have natural interpretations that should be preserved.

Step 5: Real-World Example (Impact of Normalization on KNN)

Without Normalization

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Accuracy without normalization:", accuracy_score(y_test, y_pred))

With Normalization

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn.fit(X_train_scaled, y_train)
y_pred_scaled = knn.predict(X_test_scaled)
print("Accuracy with normalization:", accuracy_score(y_test, y_pred_scaled))

Expected Outcome:

Higher accuracy with normalization.
Improved distance calculations and model performance.