Evaluating Time Series Models
1. Introduction to Time Series Model Evaluation
Time series forecasting models predict future values based on historical data. However, before deploying a model, it is crucial to evaluate its performance using proper metrics and validation techniques.
Why is Model Evaluation Important?
✅ Ensures model accuracy and reliability
✅ Helps identify overfitting or underfitting
✅ Enables comparison between different models
✅ Assists in hyperparameter tuning
✅ Ensures model robustness for real-world applications
2. Data Preparation for Evaluation
Step 1: Load Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2: Load and Split the Data
Before evaluating, split the dataset into training and testing sets.
df = pd.read_csv("time_series_data.csv")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Train-Test Split (80% Train, 20% Test)
train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]
✅ Training set: Used to train the model
✅ Test set: Used to evaluate the model’s performance
3. Evaluation Metrics for Time Series Models
Time series data is sequential, so traditional regression metrics like accuracy and R²-score may not be enough. Instead, we use error-based metrics:
1️⃣ Mean Absolute Error (MAE)
The average absolute difference between actual and predicted values. MAE=1n∑i=1n∣yi−y^i∣MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i – \hat{y}_i|
def calculate_mae(y_true, y_pred):
return mean_absolute_error(y_true, y_pred)
mae = calculate_mae(test['Value'], predicted_values)
print(f"MAE: {mae}")
✅ Lower MAE = Better model performance
2️⃣ Mean Squared Error (MSE)
The average squared difference between actual and predicted values. MSE=1n∑i=1n(yi−y^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2
def calculate_mse(y_true, y_pred):
return mean_squared_error(y_true, y_pred)
mse = calculate_mse(test['Value'], predicted_values)
print(f"MSE: {mse}")
✅ Lower MSE = Fewer large prediction errors
3️⃣ Root Mean Squared Error (RMSE)
The square root of MSE. RMSE=MSERMSE = \sqrt{MSE}
def calculate_rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
rmse = calculate_rmse(test['Value'], predicted_values)
print(f"RMSE: {rmse}")
✅ Lower RMSE = Fewer large deviations
4️⃣ Mean Absolute Percentage Error (MAPE)
Measures the percentage error of predictions. MAPE=100n∑i=1n∣yi−y^iyi∣MAPE = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i – \hat{y}_i}{y_i} \right|
def calculate_mape(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
mape = calculate_mape(test['Value'], predicted_values)
print(f"MAPE: {mape}%")
✅ Lower MAPE = Better model accuracy
5️⃣ Symmetric Mean Absolute Percentage Error (SMAPE)
Handles negative values in time series better than MAPE. SMAPE=100n∑i=1n∣yi−y^i∣(∣yi∣+∣y^i∣)/2SMAPE = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i – \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}
def calculate_smape(y_true, y_pred):
return 100 * np.mean(2 * np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred)))
smape = calculate_smape(test['Value'], predicted_values)
print(f"SMAPE: {smape}%")
✅ Lower SMAPE = More balanced evaluation
4. Model Validation Techniques for Time Series
Unlike regular datasets, time series data is ordered. We must use time-sensitive validation techniques.
1️⃣ Time-Based Train-Test Split
Splitting data sequentially while keeping order intact.
train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]
✅ Avoids data leakage
2️⃣ Rolling Forecast Origin (Walk Forward Validation)
Instead of a static test set, the model is continuously retrained on expanding data.
Example:
- Train model on first 100 days, predict day 101.
- Train on first 101 days, predict day 102.
- Repeat until the end of the dataset.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(df):
train, test = df.iloc[train_index], df.iloc[test_index]
✅ Best for real-time forecasting scenarios
3️⃣ K-Fold Cross-Validation (Time Series Aware)
Avoids shuffling time series data by keeping order intact.
tscv = TimeSeriesSplit(n_splits=5)
✅ Reduces overfitting risk
5. Visualizing Model Performance
1️⃣ Actual vs. Predicted Plot
plt.figure(figsize=(12,6))
plt.plot(test.index, test['Value'], label="Actual", color='blue')
plt.plot(test.index, predicted_values, label="Predicted", color='red')
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Actual vs Predicted")
plt.legend()
plt.show()
✅ Identifies if model follows trends correctly
2️⃣ Residual Plot
Shows differences between actual and predicted values.
residuals = test['Value'] - predicted_values
plt.figure(figsize=(12,6))
plt.hist(residuals, bins=30, edgecolor='black')
plt.title("Residual Histogram")
plt.show()
✅ Identifies systematic errors in predictions
6. Model Selection & Optimization
After evaluating multiple models, choose the best one based on:
✔ Lowest MAPE or RMSE
✔ Best fit on validation data
✔ Least overfitting
Hyperparameter Tuning
For models like ARIMA, LSTMs, XGBoost, etc., use grid search or Bayesian optimization:
from sklearn.model_selection import GridSearchCV
param_grid = {'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [100, 200, 300]}
grid = GridSearchCV(XGBRegressor(), param_grid, cv=TimeSeriesSplit(n_splits=5))
grid.fit(X_train, y_train)
best_model = grid.best_estimator_