Box Plots and Histograms: A Comprehensive Guide

Introduction

Box plots and histograms are essential tools in Exploratory Data Analysis (EDA). They help visualize the distribution, spread, central tendency, and outliers in a dataset.

Histogram: Shows the frequency distribution of a numerical variable.
Box Plot: Summarizes data distribution using quartiles and highlights outliers.

I. Understanding Histograms

What is a Histogram?

A histogram is a graphical representation of a numerical variable where:

The X-axis represents the data range (bins).
The Y-axis represents the frequency (count of values in each bin).

Why Use a Histogram?

✔ Shows data distribution (Normal, Skewed, Bimodal, etc.).
✔ Helps detect outliers and skewness.
✔ Visualizes density and spread of values.

Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

Using the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Step 3: Creating a Histogram

Let’s visualize the Age column distribution.

plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=30, kde=True)
plt.title("Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

✅ Interpretation:

The bars represent the count of passengers within age groups (bins).
The KDE curve (Kernel Density Estimation) shows the smoothed density of values.
The shape indicates skewness or normality.

Step 4: Customizing the Histogram

We can adjust bins, color, and KDE for better insights.

plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=20, kde=True, color="purple", edgecolor="black", alpha=0.7)
plt.title("Customized Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

✅ Customizations:

Bins=20 → Controls granularity.
Edgecolor=black → Defines bar edges.
Alpha=0.7 → Adjusts transparency.

Step 5: Histogram for Skewed Data (Fare Column)

plt.figure(figsize=(8,5))
sns.histplot(df["Fare"], bins=50, kde=True, color="green")
plt.title("Histogram of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()

✅ Observations:

Right-Skewed Distribution (Most values are low, with few high fares).
Outliers exist at higher fare values.

Step 6: Handling Skewed Data (Log Transformation)

df["LogFare"] = np.log1p(df["Fare"])  # log1p handles zero values
plt.figure(figsize=(8,5))
sns.histplot(df["LogFare"], bins=30, kde=True, color="blue")
plt.title("Log-Transformed Histogram of Fare")
plt.xlabel("Log Fare")
plt.ylabel("Frequency")
plt.show()

✅ Log transformation helps normalize skewed data for better analysis.

II. Understanding Box Plots

What is a Box Plot?

A box plot (or box-and-whisker plot) is a graphical representation of data distribution using five-number summary:

Minimum (Lowest non-outlier value)
First Quartile (Q1) – 25th percentile
Median (Q2) – 50th percentile
Third Quartile (Q3) – 75th percentile
Maximum (Highest non-outlier value)

🚀 Bonus: Outliers are shown as individual points beyond whiskers!

Step 7: Creating a Box Plot

plt.figure(figsize=(6,5))
sns.boxplot(y=df["Age"])
plt.title("Box Plot of Age")
plt.ylabel("Age")
plt.show()

✅ Interpretation:

Box represents the middle 50% of the data (IQR = Q3 – Q1).
Line inside the box = Median (Q2).
Whiskers extend up to 1.5 * IQR.
Dots beyond whiskers = Outliers.

Step 8: Box Plot with Categories (Survived vs Age)

plt.figure(figsize=(8,5))
sns.boxplot(x=df["Survived"], y=df["Age"], palette="coolwarm")
plt.title("Box Plot of Age by Survival")
plt.xlabel("Survived (0 = No, 1 = Yes)")
plt.ylabel("Age")
plt.show()

✅ Observations:

Younger passengers had higher survival rates.
The median age is lower for survivors.
Outliers exist in older age groups.

Step 9: Box Plot for Skewed Data (Fare Column)

plt.figure(figsize=(8,5))
sns.boxplot(y=df["Fare"])
plt.title("Box Plot of Fare")
plt.ylabel("Fare")
plt.show()

✅ Observations:

Fare is highly skewed, with extreme outliers.
Most fares are low, with few passengers paying high fares.

Step 10: Comparing Multiple Categories

plt.figure(figsize=(8,5))
sns.boxplot(x=df["Pclass"], y=df["Fare"], palette="Set2")
plt.title("Box Plot: Fare Distribution by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

✅ Insights:

Higher-class passengers paid more.
Outliers exist in all classes.

Key Differences: Histogram vs Box Plot

Feature	Histogram	Box Plot
Purpose	Shows frequency distribution	Shows summary statistics
Displays	Bins (intervals)	Min, Q1, Median, Q3, Max
Outliers	Not explicitly shown	Clearly marked
Shape of Data	Visualizes skewness & modality	Highlights spread & outliers
Best For	Understanding overall data distribution	Identifying outliers & variability

Key Takeaways

✔ Histograms show data distribution & frequency.
✔ Box plots highlight central tendency, dispersion, and outliers.
✔ Log transformation helps normalize skewed data.
✔ Both methods are essential in Exploratory Data Analysis (EDA).

Box Plots and Histograms: A Comprehensive Guide

Introduction

I. Understanding Histograms

What is a Histogram?

Why Use a Histogram?

Step 1: Importing Required Libraries

Step 2: Loading the Dataset

Step 3: Creating a Histogram

Step 4: Customizing the Histogram

Step 5: Histogram for Skewed Data (Fare Column)

Step 6: Handling Skewed Data (Log Transformation)

II. Understanding Box Plots

What is a Box Plot?

Step 7: Creating a Box Plot

Step 8: Box Plot with Categories (Survived vs Age)

Step 9: Box Plot for Skewed Data (Fare Column)

Step 10: Comparing Multiple Categories

Key Differences: Histogram vs Box Plot

Key Takeaways

Leave a Reply Cancel reply