![]()
Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance during training but poor generalization on new data. One of the most common causes of data leakage is incorrect train-test splitting.
Step 1: Understanding Incorrect Train-Test Splitting
A proper train-test split ensures that:
The test set remains completely unseen during model training.
There is no overlap between the training and test sets.
Time-dependent features (e.g., stock prices, time series data) are split chronologically to prevent look-ahead bias.
No target-related information leaks into the training data.
Incorrect Train-Test Splitting Can Lead To:
Overestimated model accuracy.
Poor real-world performance.
Overfitting due to memorization instead of learning general patterns.
Step 2: Common Mistakes That Cause Data Leakage
1. Splitting the Data After Preprocessing
Issue: If you normalize or scale the entire dataset before splitting, the test data will be influenced by the training data.
Incorrect Way
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load dataset
X, y = dataset.drop(columns=['target']), dataset['target']
# Apply scaling before train-test split (WRONG!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leakage occurs here
# Splitting after scaling
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
Correct Way
# Splitting before scaling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply scaling only on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Apply the same transformation to test data
X_test_scaled = scaler.transform(X_test)
The test set is now transformed independently, preventing leakage.
2. Data Leakage in Time-Series Data
Issue: If you randomly shuffle time-series data, future information may leak into training.
Incorrect Way (Random Split in Time-Series Data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
Correct Way (Chronological Split in Time-Series Data)
split_index = int(0.8 * len(X)) # Use 80% for training, 20% for testing
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
Ensures that training data comes before test data, preventing future leaks.
3. Leakage Through Target Encoding
Issue: If target encoding is applied before splitting, information about target labels leaks into the training set.
Incorrect Way (Encoding Before Split)
from category_encoders import TargetEncoder
encoder = TargetEncoder()
X_encoded = encoder.fit_transform(X, y) # Leakage occurs here
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
Correct Way (Encoding After Split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
encoder = TargetEncoder()
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)
Ensures that target encoding is only learned from training data.
Step 3: Preventing Data Leakage with Pipelines
To avoid manual mistakes, use Pipeline from sklearn.pipeline to handle transformations only on training data.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Creating pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
# Fitting only on training data
pipeline.fit(X_train, y_train)
# Evaluating on test data
test_accuracy = pipeline.score(X_test, y_test)
print("Test Accuracy:", test_accuracy)
Ensures that scaling is only learned from training data and prevents leakage.
Step 4: How to Detect Data Leakage?
Signs of Data Leakage:
- Training accuracy is much higher than test accuracy (e.g., 99% vs 60%).
- Test accuracy is suspiciously high compared to expected benchmarks.
- Feature importance analysis reveals “impossible” features, such as an ID number highly correlated with the target.
- Validation performance drops significantly on real-world data.
Check if train-test performance is consistent:
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)
print(f"Training Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")
If train accuracy is much higher than test accuracy → Data leakage is likely!
Step 5: Summary of Best Practices
| Common Mistake | Fix |
|---|---|
| Scaling before train-test split | Scale after splitting |
| Random splitting of time-series data | Use chronological split |
| Applying target encoding before split | Encode after splitting |
| Manually transforming features | Use Pipelines to prevent leakage |
