Scatter Plots and Pair Plots: A Comprehensive Guide
Introduction
Scatter plots and pair plots are essential tools in Exploratory Data Analysis (EDA). They help visualize relationships between numerical variables and detect trends, correlations, and patterns in datasets.
- Scatter Plot: Shows the relationship between two numerical variables.
- Pair Plot: Displays scatter plots for multiple numerical variables, comparing them in pairs.
I. Scatter Plots
What is a Scatter Plot?
A scatter plot represents data points using Cartesian coordinates, where:
- The X-axis represents one numerical variable.
- The Y-axis represents another numerical variable.
- Each dot (point) represents a data observation.
Why Use Scatter Plots?
✔ Helps identify correlations (positive, negative, or none).
✔ Reveals trends, clusters, and outliers.
✔ Useful for linear and non-linear relationships.
Step 1: Importing Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Loading the Dataset
We’ll use the Titanic dataset for demonstration.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())
Step 3: Creating a Basic Scatter Plot
Let’s visualize the relationship between Age and Fare.
plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"])
plt.title("Scatter Plot of Age vs Fare")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()
✅ Interpretation:
- Each point represents a passenger.
- Helps check if older passengers tend to pay more/less for tickets.
- If points form an upward trend, there’s a positive correlation.
Step 4: Adding Customizations to the Scatter Plot
plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"], alpha=0.6, edgecolor="black", color="blue")
plt.title("Customized Scatter Plot: Age vs Fare")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.grid(True)
plt.show()
✅ Customizations:
- alpha=0.6 → Adjusts transparency to prevent overlapping points.
- edgecolor=”black” → Highlights individual points.
- grid=True → Adds background grid for better readability.
Step 5: Scatter Plot with Categories (Color Encoding)
We can use hue to differentiate groups (e.g., Survival status).
plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"], hue=df["Survived"], palette="coolwarm")
plt.title("Scatter Plot: Age vs Fare (Survival Status)")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()
✅ Observations:
- Different colors represent Survived (1) vs Not Survived (0).
- If high-fare passengers survive more, color distribution will show it.
Step 6: Scatter Plot with Regression Line (Trend Line)
We can add a regression line to check linear relationships.
plt.figure(figsize=(8,5))
sns.regplot(x=df["Age"], y=df["Fare"], scatter_kws={"alpha":0.5}, line_kws={"color":"red"})
plt.title("Scatter Plot with Regression Line")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()
✅ Regression Line:
- Red Line shows the overall trend.
- Slope direction determines the correlation (positive/negative).
II. Pair Plots
What is a Pair Plot?
A pair plot (also called a scatterplot matrix) shows scatter plots for every pair of numerical variables in the dataset.
Why Use Pair Plots?
✔ Compares all numerical variables at once.
✔ Helps detect relationships, clusters, and trends.
✔ Includes histograms on the diagonal for individual distributions.
Step 7: Creating a Pair Plot
sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]])
plt.show()
✅ Interpretation:
- Each subplot is a scatter plot of two features.
- The diagonal shows histograms (distribution of each feature).
- Helps identify correlations & patterns.
Step 8: Pair Plot with Categories (Hue Parameter)
We can color-code the points based on a categorical feature (e.g., Survived
).
sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived", palette="coolwarm")
plt.show()
✅ Observations:
- Different colors show how survival is distributed across features.
- If survival depends on
Fare
orAge
, it will be visible.
Step 9: Pair Plot with KDE for Distribution Analysis
We can change the diagonal elements from histograms to KDE plots for smooth density estimation.
sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived", diag_kind="kde")
plt.show()
✅ Why KDE?
- KDE gives a smooth density curve, unlike a histogram.
- Helps understand underlying distributions better.
Step 10: Selecting Specific Features in Pair Plot
If the dataset has too many numerical columns, we can select specific ones.
features = ["Age", "Fare", "Pclass", "SibSp", "Survived"]
sns.pairplot(df[features], hue="Survived", palette="coolwarm")
plt.show()
✅ Benefits:
- Avoids clutter in large datasets.
- Focuses on the most relevant variables.
Key Differences: Scatter Plot vs Pair Plot
Feature | Scatter Plot | Pair Plot |
---|---|---|
Purpose | Shows relationship between two variables | Compares all numerical variables at once |
Displays | Single scatter plot | Multiple scatter plots in a grid |
Outliers | Shows in two variables | Detects in multiple variables |
Best For | Understanding one relationship at a time | Finding overall patterns in data |
Key Takeaways
✔ Scatter Plots reveal relationships between two numerical variables.
✔ Pair Plots provide a complete comparison of multiple variables.
✔ Hue parameter in both plots helps visualize categorical effects.
✔ Regression lines in scatter plots highlight trends and correlations.
✔ Pair Plots with KDE provide a smooth distribution analysis.