Evaluating Time Series Models

1. Introduction to Time Series Model Evaluation

Time series forecasting models predict future values based on historical data. However, before deploying a model, it is crucial to evaluate its performance using proper metrics and validation techniques.

Why is Model Evaluation Important?

✅ Ensures model accuracy and reliability
✅ Helps identify overfitting or underfitting
✅ Enables comparison between different models
✅ Assists in hyperparameter tuning
✅ Ensures model robustness for real-world applications

2. Data Preparation for Evaluation

Step 1: Load Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Step 2: Load and Split the Data

Before evaluating, split the dataset into training and testing sets.

df = pd.read_csv("time_series_data.csv")
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Train-Test Split (80% Train, 20% Test)
train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]

✅ Training set: Used to train the model
✅ Test set: Used to evaluate the model’s performance

3. Evaluation Metrics for Time Series Models

Time series data is sequential, so traditional regression metrics like accuracy and R²-score may not be enough. Instead, we use error-based metrics:

1️⃣ Mean Absolute Error (MAE)

The average absolute difference between actual and predicted values. MAE=1n∑i=1n∣yi−y^i∣MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i – \hat{y}_i|

def calculate_mae(y_true, y_pred):
    return mean_absolute_error(y_true, y_pred)

mae = calculate_mae(test['Value'], predicted_values)
print(f"MAE: {mae}")

✅ Lower MAE = Better model performance

2️⃣ Mean Squared Error (MSE)

The average squared difference between actual and predicted values. MSE=1n∑i=1n(yi−y^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2

def calculate_mse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred)

mse = calculate_mse(test['Value'], predicted_values)
print(f"MSE: {mse}")

✅ Lower MSE = Fewer large prediction errors

3️⃣ Root Mean Squared Error (RMSE)

The square root of MSE. RMSE=MSERMSE = \sqrt{MSE}

def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

rmse = calculate_rmse(test['Value'], predicted_values)
print(f"RMSE: {rmse}")

✅ Lower RMSE = Fewer large deviations

4️⃣ Mean Absolute Percentage Error (MAPE)

Measures the percentage error of predictions. MAPE=100n∑i=1n∣yi−y^iyi∣MAPE = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i – \hat{y}_i}{y_i} \right|

def calculate_mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = calculate_mape(test['Value'], predicted_values)
print(f"MAPE: {mape}%")

✅ Lower MAPE = Better model accuracy

5️⃣ Symmetric Mean Absolute Percentage Error (SMAPE)

Handles negative values in time series better than MAPE. SMAPE=100n∑i=1n∣yi−y^i∣(∣yi∣+∣y^i∣)/2SMAPE = \frac{100}{n} \sum_{i=1}^{n} \frac{|y_i – \hat{y}_i|}{(|y_i| + |\hat{y}_i|)/2}

def calculate_smape(y_true, y_pred):
    return 100 * np.mean(2 * np.abs(y_true - y_pred) / (np.abs(y_true) + np.abs(y_pred)))

smape = calculate_smape(test['Value'], predicted_values)
print(f"SMAPE: {smape}%")

✅ Lower SMAPE = More balanced evaluation

4. Model Validation Techniques for Time Series

Unlike regular datasets, time series data is ordered. We must use time-sensitive validation techniques.

1️⃣ Time-Based Train-Test Split

Splitting data sequentially while keeping order intact.

train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]

✅ Avoids data leakage

2️⃣ Rolling Forecast Origin (Walk Forward Validation)

Instead of a static test set, the model is continuously retrained on expanding data.

Example:

Train model on first 100 days, predict day 101.
Train on first 101 days, predict day 102.
Repeat until the end of the dataset.

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_index, test_index in tscv.split(df):
    train, test = df.iloc[train_index], df.iloc[test_index]

✅ Best for real-time forecasting scenarios

3️⃣ K-Fold Cross-Validation (Time Series Aware)

Avoids shuffling time series data by keeping order intact.

tscv = TimeSeriesSplit(n_splits=5)

✅ Reduces overfitting risk

5. Visualizing Model Performance

1️⃣ Actual vs. Predicted Plot

plt.figure(figsize=(12,6))
plt.plot(test.index, test['Value'], label="Actual", color='blue')
plt.plot(test.index, predicted_values, label="Predicted", color='red')
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Actual vs Predicted")
plt.legend()
plt.show()

✅ Identifies if model follows trends correctly

2️⃣ Residual Plot

Shows differences between actual and predicted values.

residuals = test['Value'] - predicted_values
plt.figure(figsize=(12,6))
plt.hist(residuals, bins=30, edgecolor='black')
plt.title("Residual Histogram")
plt.show()

✅ Identifies systematic errors in predictions

6. Model Selection & Optimization

After evaluating multiple models, choose the best one based on:
✔ Lowest MAPE or RMSE
✔ Best fit on validation data
✔ Least overfitting

Hyperparameter Tuning

For models like ARIMA, LSTMs, XGBoost, etc., use grid search or Bayesian optimization:

from sklearn.model_selection import GridSearchCV

param_grid = {'learning_rate': [0.01, 0.1, 0.2], 'n_estimators': [100, 200, 300]}
grid = GridSearchCV(XGBRegressor(), param_grid, cv=TimeSeriesSplit(n_splits=5))
grid.fit(X_train, y_train)

best_model = grid.best_estimator_