Box Plots and Histograms: A Comprehensive Guide
Introduction
Box plots and histograms are essential tools in Exploratory Data Analysis (EDA). They help visualize the distribution, spread, central tendency, and outliers in a dataset.
- Histogram: Shows the frequency distribution of a numerical variable.
- Box Plot: Summarizes data distribution using quartiles and highlights outliers.
I. Understanding Histograms
What is a Histogram?
A histogram is a graphical representation of a numerical variable where:
- The X-axis represents the data range (bins).
- The Y-axis represents the frequency (count of values in each bin).
Why Use a Histogram?
✔ Shows data distribution (Normal, Skewed, Bimodal, etc.).
✔ Helps detect outliers and skewness.
✔ Visualizes density and spread of values.
Step 1: Importing Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Loading the Dataset
Using the Titanic dataset for demonstration.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())
Step 3: Creating a Histogram
Let’s visualize the Age column distribution.
plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=30, kde=True)
plt.title("Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()
✅ Interpretation:
- The bars represent the count of passengers within age groups (bins).
- The KDE curve (Kernel Density Estimation) shows the smoothed density of values.
- The shape indicates skewness or normality.
Step 4: Customizing the Histogram
We can adjust bins, color, and KDE for better insights.
plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=20, kde=True, color="purple", edgecolor="black", alpha=0.7)
plt.title("Customized Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
✅ Customizations:
- Bins=20 → Controls granularity.
- Edgecolor=black → Defines bar edges.
- Alpha=0.7 → Adjusts transparency.
Step 5: Histogram for Skewed Data (Fare Column)
plt.figure(figsize=(8,5))
sns.histplot(df["Fare"], bins=50, kde=True, color="green")
plt.title("Histogram of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()
✅ Observations:
- Right-Skewed Distribution (Most values are low, with few high fares).
- Outliers exist at higher fare values.
Step 6: Handling Skewed Data (Log Transformation)
df["LogFare"] = np.log1p(df["Fare"]) # log1p handles zero values
plt.figure(figsize=(8,5))
sns.histplot(df["LogFare"], bins=30, kde=True, color="blue")
plt.title("Log-Transformed Histogram of Fare")
plt.xlabel("Log Fare")
plt.ylabel("Frequency")
plt.show()
✅ Log transformation helps normalize skewed data for better analysis.
II. Understanding Box Plots
What is a Box Plot?
A box plot (or box-and-whisker plot) is a graphical representation of data distribution using five-number summary:
- Minimum (Lowest non-outlier value)
- First Quartile (Q1) – 25th percentile
- Median (Q2) – 50th percentile
- Third Quartile (Q3) – 75th percentile
- Maximum (Highest non-outlier value)
🚀 Bonus: Outliers are shown as individual points beyond whiskers!
Step 7: Creating a Box Plot
plt.figure(figsize=(6,5))
sns.boxplot(y=df["Age"])
plt.title("Box Plot of Age")
plt.ylabel("Age")
plt.show()
✅ Interpretation:
- Box represents the middle 50% of the data (IQR = Q3 – Q1).
- Line inside the box = Median (Q2).
- Whiskers extend up to 1.5 * IQR.
- Dots beyond whiskers = Outliers.
Step 8: Box Plot with Categories (Survived vs Age)
plt.figure(figsize=(8,5))
sns.boxplot(x=df["Survived"], y=df["Age"], palette="coolwarm")
plt.title("Box Plot of Age by Survival")
plt.xlabel("Survived (0 = No, 1 = Yes)")
plt.ylabel("Age")
plt.show()
✅ Observations:
- Younger passengers had higher survival rates.
- The median age is lower for survivors.
- Outliers exist in older age groups.
Step 9: Box Plot for Skewed Data (Fare Column)
plt.figure(figsize=(8,5))
sns.boxplot(y=df["Fare"])
plt.title("Box Plot of Fare")
plt.ylabel("Fare")
plt.show()
✅ Observations:
- Fare is highly skewed, with extreme outliers.
- Most fares are low, with few passengers paying high fares.
Step 10: Comparing Multiple Categories
plt.figure(figsize=(8,5))
sns.boxplot(x=df["Pclass"], y=df["Fare"], palette="Set2")
plt.title("Box Plot: Fare Distribution by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()
✅ Insights:
- Higher-class passengers paid more.
- Outliers exist in all classes.
Key Differences: Histogram vs Box Plot
Feature | Histogram | Box Plot |
---|---|---|
Purpose | Shows frequency distribution | Shows summary statistics |
Displays | Bins (intervals) | Min, Q1, Median, Q3, Max |
Outliers | Not explicitly shown | Clearly marked |
Shape of Data | Visualizes skewness & modality | Highlights spread & outliers |
Best For | Understanding overall data distribution | Identifying outliers & variability |
Key Takeaways
✔ Histograms show data distribution & frequency.
✔ Box plots highlight central tendency, dispersion, and outliers.
✔ Log transformation helps normalize skewed data.
✔ Both methods are essential in Exploratory Data Analysis (EDA).