Underfitting vs Overfitting

Loading

Underfitting vs Overfitting in Machine Learning

Introduction

One of the biggest challenges in machine learning is building a model that can generalize well to unseen data. The two common problems that arise while training machine learning models are:

  • Underfitting – The model is too simple and fails to learn from the training data.
  • Overfitting – The model is too complex and memorizes the training data instead of learning general patterns.

Balancing these two issues is crucial to developing a robust machine learning model. This balance is often referred to as the bias-variance tradeoff, where underfitting is caused by high bias, and overfitting is caused by high variance.


What is Underfitting?

Definition

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the dataset. As a result, the model performs poorly on both training data and test data.

Causes of Underfitting

  1. Using a very simple model that cannot learn complex relationships (e.g., using linear regression for highly non-linear data).
  2. Too few features – Not enough information is provided to the model.
  3. Excessive regularization – Too much L1 or L2 regularization can restrict the model from learning effectively.
  4. Insufficient training data – If the dataset is too small, the model might not be able to learn general patterns.

Symptoms of Underfitting

  • High error on the training set.
  • High error on the test set.
  • The model is unable to capture the complexity of the data.

Example of Underfitting

Imagine trying to predict house prices using only the number of bedrooms. While this is an important feature, it ignores other key factors like location, square footage, and condition. The model is too simple and does not provide accurate predictions.

Visual Representation of Underfitting

If we try to fit a linear model to a dataset with a complex non-linear pattern, the result will be underfitting:

📉 Underfitting (High Bias)

True relationship:       ----=====----=====----
Model prediction:        ----------------------

The model does not capture the real pattern in the data, leading to poor performance on both training and test data.


What is Overfitting?

Definition

Overfitting occurs when a model learns the training data too well, including noise and random fluctuations. It performs well on training data but fails to generalize to new, unseen data.

Causes of Overfitting

  1. Using a very complex model – Too many parameters or features cause the model to memorize the data instead of generalizing.
  2. Too few training examples – If the dataset is too small, the model can learn noise instead of meaningful patterns.
  3. Insufficient regularization – Without techniques like L1/L2 regularization, the model may become too complex.
  4. Too many features – If unnecessary or irrelevant features are included, the model may pick up noise.

Symptoms of Overfitting

  • Very low error on the training set.
  • High error on the test set.
  • The model performs well on known data but poorly on new data.

Example of Overfitting

Imagine training a deep neural network to recognize handwritten digits. If the model is too complex, it might memorize specific handwriting styles in the training set instead of learning general patterns, leading to poor performance on new handwriting samples.

Visual Representation of Overfitting

If we try to fit a high-degree polynomial to a dataset, we might get a model that fits every data point perfectly but does not generalize well:

📈 Overfitting (High Variance)

True relationship:       ----=====----=====----
Model prediction:        ~~~==~~===~==~~==~~

The model memorizes noise instead of learning the actual pattern.


Comparison: Underfitting vs Overfitting

FeatureUnderfittingOverfitting
DefinitionModel is too simple and cannot capture patterns.Model is too complex and memorizes training data.
Training ErrorHighLow
Test ErrorHighHigh
Model TypeToo simpleToo complex
CauseHigh biasHigh variance
Performance on New DataPoorPoor
ExampleLinear regression on complex dataDeep neural network with too many layers

How to Avoid Underfitting and Overfitting?

1. Preventing Underfitting

Increase Model Complexity – Use a more complex model (e.g., switching from linear regression to polynomial regression or neural networks).
Feature Engineering – Add more relevant features to improve predictive power.
Reduce Regularization – Too much L1/L2 regularization can prevent the model from learning effectively.
Train Longer – Increase training time or use a different optimization algorithm.

2. Preventing Overfitting

Use Regularization – L1 (Lasso) and L2 (Ridge) regularization help reduce model complexity.

from sklearn.linear_model import Ridge

# Applying L2 Regularization
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train, y_train)

Cross-Validation – Implement k-fold cross-validation to check how the model performs on different subsets of data.

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(model, X, y, cv=5)
print(f"Mean Accuracy: {cv_scores.mean():.2f}")

Pruning in Decision Trees – Prevent deep decision trees from memorizing data by pruning unnecessary branches.

Dropout in Neural Networks – Randomly deactivate neurons during training to prevent memorization.

Increase Training Data – More data helps the model learn general patterns instead of memorizing specific examples.

Reduce Model Complexity – Use simpler models (e.g., reducing polynomial degree in polynomial regression).


Real-World Examples

1. Spam Email Detection

  • Underfitting: A simple model that looks for the word “free” may miss advanced spam tactics.
  • Overfitting: A model trained on a small dataset memorizes certain spam emails and fails on new spam emails.

2. Stock Price Prediction

  • Underfitting: Using only the closing price to predict future prices.
  • Overfitting: A deep learning model that memorizes past stock prices but fails to predict future trends.

Leave a Reply

Your email address will not be published. Required fields are marked *