Decision Trees

Loading

Decision Trees in Machine Learning

1. Introduction to Decision Trees

A Decision Tree is a Supervised Learning algorithm used for both classification and regression problems. It mimics human decision-making by splitting data into branches to make predictions.

🌳 Key Features of Decision Trees:

βœ” Easy to understand and interpret
βœ” Works with both categorical & numerical data
βœ” Can model non-linear relationships
βœ” Requires minimal data preprocessing

πŸ“Œ Real-world Applications:
βœ… Spam Detection (Spam or Not Spam)
βœ… Credit Risk Analysis (Loan Default or Not)
βœ… Medical Diagnosis (Disease Present or Not)
βœ… Customer Segmentation (High-Value vs Low-Value Customers)


2. How Decision Trees Work?

A Decision Tree follows a hierarchical tree-like structure, consisting of:

  • Root Node β†’ The starting point (entire dataset)
  • Decision Nodes β†’ Intermediate nodes where data is split
  • Leaf Nodes β†’ Final nodes with class labels (output)

πŸ“Œ Example:
Imagine you need to classify whether a customer will buy a product.
1️⃣ Start at the root node: “Is income > $50K?”
2️⃣ If Yes, move to the next decision: “Is age > 30?”
3️⃣ If No, predict: “Will not buy”
4️⃣ Keep splitting data until a final decision (leaf node) is reached.


3. Splitting Criteria in Decision Trees

To determine the best feature for splitting, Decision Trees use impurity measures like:

1️⃣ Gini Impurity

Gini=1βˆ’βˆ‘(pi)2Gini = 1 – \sum (p_i)^2

  • Measures impurity in a dataset
  • Lower Gini = Better purity
  • Default criterion in Scikit-Learn

2️⃣ Entropy & Information Gain

Entropy=βˆ’βˆ‘pilog⁑2piEntropy = -\sum p_i \log_2 p_i Information Gain=Entropyparentβˆ’βˆ‘(weighted child entropy)Information\ Gain = Entropy_{parent} – \sum \text{(weighted child entropy)}

  • Entropy measures uncertainty in data
  • Information Gain chooses features with highest entropy reduction

3️⃣ Mean Squared Error (MSE) for Regression Trees

MSE=1nβˆ‘(yiβˆ’y^)2MSE = \frac{1}{n} \sum (y_i – \hat{y})^2

  • Used for continuous output predictions

πŸ“Œ Gini vs Entropy
βœ” Gini is faster (computationally less expensive)
βœ” Entropy is more informative but requires more computation


4. Overfitting and Pruning in Decision Trees

A deep Decision Tree may lead to overfitting (high accuracy on training data, poor generalization).

βœ… Solution: Pruning (reducing tree complexity)

  • Pre-Pruning (Stop growing tree early)
  • Post-Pruning (Trim branches after full growth)

βœ” Techniques: Setting maximum depth, minimum samples per leaf, pruning weak branches


5. Advantages & Disadvantages of Decision Trees

βœ… Advantages

βœ” Simple & easy to understand
βœ” No need for feature scaling
βœ” Handles categorical & numerical data
βœ” Works well with missing values

❌ Disadvantages

❌ Prone to overfitting
❌ Biased towards dominant classes
❌ Unstable (small data changes can change tree structure)
❌ Greedy algorithm (locally optimal splits may not be globally best)


6. Implementing Decision Trees in Python (Sklearn)

Let’s build a Decision Tree using Scikit-Learn!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import tree

# Sample Dataset
data = {'Age': [25, 30, 35, 40, 45, 50, 55, 60],
        'Salary': [30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000],
        'Buys_Product': [0, 0, 1, 1, 1, 1, 0, 0]}

df = pd.DataFrame(data)

# Features & Target
X = df[['Age', 'Salary']]
y = df['Buys_Product']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree
model = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(report)

# Visualize the Decision Tree
plt.figure(figsize=(10, 6))
tree.plot_tree(model, feature_names=['Age', 'Salary'], class_names=['No', 'Yes'], filled=True)
plt.show()

7. Hyperparameters of Decision Trees

πŸ”Ή max_depth β†’ Limits depth to prevent overfitting
πŸ”Ή min_samples_split β†’ Minimum samples needed to split
πŸ”Ή min_samples_leaf β†’ Minimum samples per leaf
πŸ”Ή criterion β†’ Choose “gini” or “entropy”
πŸ”Ή max_features β†’ Number of features to consider for the best split

πŸ“Œ Hyperparameter tuning helps improve model performance and reduce overfitting!


8. Decision Trees for Regression (Regression Trees)

Decision Trees can also predict continuous values (e.g., house prices). Instead of classification, they use Mean Squared Error (MSE) for splits.

from sklearn.tree import DecisionTreeRegressor

# Train Regression Tree
regressor = DecisionTreeRegressor(max_depth=3, random_state=42)
regressor.fit(X_train, y_train)

# Predict and Evaluate
y_pred = regressor.predict(X_test)

βœ” Works well with non-linear data
βœ” Captures interactions between features


9. Decision Trees vs Other Algorithms

AlgorithmStrengthsWeaknesses
Decision TreesEasy to interpret, no feature scaling neededProne to overfitting
Random ForestMore accurate, reduces overfittingComputationally expensive
SVMWorks well with high-dimensional dataNeeds proper tuning
Neural NetworksHandles complex patternsRequires large data and tuning

πŸ“Œ Ensemble Methods like Random Forest and Gradient Boosting improve Decision Tree performance!


10. Summary

βœ” Decision Trees split data based on conditions to make predictions.
βœ” Use Gini Impurity, Entropy, or MSE for splitting.
βœ” Can be used for classification & regression tasks.
βœ” Pruning & hyperparameter tuning prevent overfitting.
βœ” Used in Finance, Healthcare, Marketing, and many other fields.

Mastering Decision Trees is crucial for building robust Machine Learning models!

Leave a Reply

Your email address will not be published. Required fields are marked *