Random Forests in Machine Learning
1. Introduction to Random Forests
Random Forest is a Supervised Machine Learning algorithm that is used for both Classification and Regression tasks. It is an ensemble learning method that combines multiple Decision Trees to improve accuracy and reduce overfitting.
🌳 Why Use Random Forests?
✔ Higher Accuracy than a single Decision Tree
✔ Reduces Overfitting
✔ Handles Missing Data Well
✔ Works with Categorical & Numerical Data
✔ Can Handle Large Datasets
📌 Real-world Applications:
✅ Fraud Detection (Banking & Finance)
✅ Medical Diagnosis (Disease Prediction)
✅ Customer Segmentation (Marketing & E-commerce)
✅ Stock Market Prediction
✅ Image Recognition
2. How Does Random Forest Work?
A Random Forest consists of multiple Decision Trees. Instead of relying on a single tree, it aggregates the predictions from many trees to make a more robust and accurate prediction.
🌲 Steps to Build a Random Forest
🔹 Step 1: Select random subsets of the dataset (Bootstrap Sampling)
🔹 Step 2: Build multiple Decision Trees independently
🔹 Step 3: Use Random Feature Selection at each node to make the trees diverse
🔹 Step 4: Aggregate the predictions
- 🟢 For Classification: Uses Majority Voting (mode of all trees)
- 🔴 For Regression: Uses Averaging (mean of all trees)
📌 Example:
- If 7 out of 10 trees predict “Spam” and 3 predict “Not Spam”, the final result is “Spam”.
3. Key Concepts in Random Forests
📌 1️⃣ Bootstrap Aggregation (Bagging)
- Each tree is trained on a different random subset of data (sampling with replacement).
- This increases diversity among trees and reduces overfitting.
📌 2️⃣ Random Feature Selection
- Instead of using all features, only a subset of features is considered when splitting each node.
- This makes trees less correlated and improves generalization.
📌 3️⃣ Majority Voting (For Classification)
- Each tree predicts a class, and the most frequent class becomes the final prediction.
📌 4️⃣ Averaging (For Regression)
- Each tree predicts a numerical value, and the average of all trees is the final output.
4. Advantages & Disadvantages of Random Forests
✅ Advantages
✔ Higher Accuracy than a single Decision Tree
✔ Prevents Overfitting with randomization
✔ Works Well with Missing Data
✔ Reduces Variance (Bias-Variance Tradeoff)
✔ Can be used for both Classification & Regression
✔ Feature Importance Ranking
❌ Disadvantages
❌ Computationally Expensive (Slower than Decision Trees)
❌ Harder to Interpret than a single Decision Tree
❌ Requires More Memory (Many trees need storage)
5. Implementing Random Forest in Python (Sklearn)
Let’s build a Random Forest Classifier using the Scikit-Learn library.
📌 Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
📌 Load Data
# Sample Dataset
data = {'Age': [25, 30, 35, 40, 45, 50, 55, 60],
'Salary': [30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000],
'Buys_Product': [0, 0, 1, 1, 1, 1, 0, 0]}
df = pd.DataFrame(data)
# Features & Target
X = df[['Age', 'Salary']]
y = df['Buys_Product']
📌 Split Data into Training & Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
📌 Train a Random Forest Model
# Initialize Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# Train the model
rf_model.fit(X_train, y_train)
📌 Make Predictions & Evaluate
# Predict on test data
y_pred = rf_model.predict(X_test)
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(report)
6. Hyperparameters of Random Forests
📌 Important Hyperparameters
🔹 n_estimators
→ Number of trees (default: 100)
🔹 max_depth
→ Maximum depth of trees
🔹 min_samples_split
→ Minimum samples needed to split a node
🔹 min_samples_leaf
→ Minimum samples per leaf
🔹 max_features
→ Number of features considered at each split
🔹 bootstrap
→ Whether to use Bootstrap Sampling (default: True)
📌 Tuning these hyperparameters improves accuracy and reduces overfitting!
7. Feature Importance in Random Forests
One major advantage of Random Forests is that they can determine which features are most important in making predictions.
📌 Find Important Features
# Get feature importance scores
feature_importances = rf_model.feature_importances_
# Display feature importance
for feature, importance in zip(X.columns, feature_importances):
print(f'{feature}: {importance:.4f}')
✔ This helps in Feature Selection, reducing unnecessary features to improve model efficiency.
8. Random Forest vs Decision Tree
Feature | Decision Tree | Random Forest |
---|---|---|
Accuracy | Moderate | High |
Overfitting | High | Low |
Interpretability | Easy | Hard |
Computational Speed | Fast | Slower |
Handling Missing Data | Moderate | Better |
Feature Selection | No | Yes |
📌 Random Forest is more accurate & robust but requires more computation!
9. Random Forest for Regression
Random Forest can also be used for regression tasks (predicting continuous values).
from sklearn.ensemble import RandomForestRegressor
# Train Regression Model
rf_regressor = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
rf_regressor.fit(X_train, y_train)
# Predict
y_pred = rf_regressor.predict(X_test)
✔ Works well for house price prediction, stock price forecasting, sales forecasting, etc.
10. Summary
✔ Random Forest is an ensemble method combining multiple Decision Trees.
✔ Uses Bootstrapping, Random Feature Selection, and Majority Voting.
✔ Reduces Overfitting compared to a single Decision Tree.
✔ Used for both Classification & Regression.
✔ Feature Importance helps in selecting the best features.
Mastering Random Forests is key to building powerful predictive models!