Box Plots and Histograms

Loading

Box Plots and Histograms: A Comprehensive Guide

Introduction

Box plots and histograms are essential tools in Exploratory Data Analysis (EDA). They help visualize the distribution, spread, central tendency, and outliers in a dataset.

  • Histogram: Shows the frequency distribution of a numerical variable.
  • Box Plot: Summarizes data distribution using quartiles and highlights outliers.

I. Understanding Histograms

What is a Histogram?

A histogram is a graphical representation of a numerical variable where:

  • The X-axis represents the data range (bins).
  • The Y-axis represents the frequency (count of values in each bin).

Why Use a Histogram?

✔ Shows data distribution (Normal, Skewed, Bimodal, etc.).
✔ Helps detect outliers and skewness.
✔ Visualizes density and spread of values.


Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

Using the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Step 3: Creating a Histogram

Let’s visualize the Age column distribution.

plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=30, kde=True)
plt.title("Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Interpretation:

  • The bars represent the count of passengers within age groups (bins).
  • The KDE curve (Kernel Density Estimation) shows the smoothed density of values.
  • The shape indicates skewness or normality.

Step 4: Customizing the Histogram

We can adjust bins, color, and KDE for better insights.

plt.figure(figsize=(8,5))
sns.histplot(df["Age"].dropna(), bins=20, kde=True, color="purple", edgecolor="black", alpha=0.7)
plt.title("Customized Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()

Customizations:

  • Bins=20 → Controls granularity.
  • Edgecolor=black → Defines bar edges.
  • Alpha=0.7 → Adjusts transparency.

Step 5: Histogram for Skewed Data (Fare Column)

plt.figure(figsize=(8,5))
sns.histplot(df["Fare"], bins=50, kde=True, color="green")
plt.title("Histogram of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()

Observations:

  • Right-Skewed Distribution (Most values are low, with few high fares).
  • Outliers exist at higher fare values.

Step 6: Handling Skewed Data (Log Transformation)

df["LogFare"] = np.log1p(df["Fare"])  # log1p handles zero values
plt.figure(figsize=(8,5))
sns.histplot(df["LogFare"], bins=30, kde=True, color="blue")
plt.title("Log-Transformed Histogram of Fare")
plt.xlabel("Log Fare")
plt.ylabel("Frequency")
plt.show()

Log transformation helps normalize skewed data for better analysis.


II. Understanding Box Plots

What is a Box Plot?

A box plot (or box-and-whisker plot) is a graphical representation of data distribution using five-number summary:

  1. Minimum (Lowest non-outlier value)
  2. First Quartile (Q1) – 25th percentile
  3. Median (Q2) – 50th percentile
  4. Third Quartile (Q3) – 75th percentile
  5. Maximum (Highest non-outlier value)

🚀 Bonus: Outliers are shown as individual points beyond whiskers!


Step 7: Creating a Box Plot

plt.figure(figsize=(6,5))
sns.boxplot(y=df["Age"])
plt.title("Box Plot of Age")
plt.ylabel("Age")
plt.show()

Interpretation:

  • Box represents the middle 50% of the data (IQR = Q3 – Q1).
  • Line inside the box = Median (Q2).
  • Whiskers extend up to 1.5 * IQR.
  • Dots beyond whiskers = Outliers.

Step 8: Box Plot with Categories (Survived vs Age)

plt.figure(figsize=(8,5))
sns.boxplot(x=df["Survived"], y=df["Age"], palette="coolwarm")
plt.title("Box Plot of Age by Survival")
plt.xlabel("Survived (0 = No, 1 = Yes)")
plt.ylabel("Age")
plt.show()

Observations:

  • Younger passengers had higher survival rates.
  • The median age is lower for survivors.
  • Outliers exist in older age groups.

Step 9: Box Plot for Skewed Data (Fare Column)

plt.figure(figsize=(8,5))
sns.boxplot(y=df["Fare"])
plt.title("Box Plot of Fare")
plt.ylabel("Fare")
plt.show()

Observations:

  • Fare is highly skewed, with extreme outliers.
  • Most fares are low, with few passengers paying high fares.

Step 10: Comparing Multiple Categories

plt.figure(figsize=(8,5))
sns.boxplot(x=df["Pclass"], y=df["Fare"], palette="Set2")
plt.title("Box Plot: Fare Distribution by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Fare")
plt.show()

Insights:

  • Higher-class passengers paid more.
  • Outliers exist in all classes.

Key Differences: Histogram vs Box Plot

FeatureHistogramBox Plot
PurposeShows frequency distributionShows summary statistics
DisplaysBins (intervals)Min, Q1, Median, Q3, Max
OutliersNot explicitly shownClearly marked
Shape of DataVisualizes skewness & modalityHighlights spread & outliers
Best ForUnderstanding overall data distributionIdentifying outliers & variability

Key Takeaways

Histograms show data distribution & frequency.
Box plots highlight central tendency, dispersion, and outliers.
Log transformation helps normalize skewed data.
Both methods are essential in Exploratory Data Analysis (EDA).


Leave a Reply

Your email address will not be published. Required fields are marked *