Scatter Plots and Pair Plots

Loading

Scatter Plots and Pair Plots: A Comprehensive Guide

Introduction

Scatter plots and pair plots are essential tools in Exploratory Data Analysis (EDA). They help visualize relationships between numerical variables and detect trends, correlations, and patterns in datasets.

  • Scatter Plot: Shows the relationship between two numerical variables.
  • Pair Plot: Displays scatter plots for multiple numerical variables, comparing them in pairs.

I. Scatter Plots

What is a Scatter Plot?

A scatter plot represents data points using Cartesian coordinates, where:

  • The X-axis represents one numerical variable.
  • The Y-axis represents another numerical variable.
  • Each dot (point) represents a data observation.

Why Use Scatter Plots?

✔ Helps identify correlations (positive, negative, or none).
✔ Reveals trends, clusters, and outliers.
✔ Useful for linear and non-linear relationships.


Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

We’ll use the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Step 3: Creating a Basic Scatter Plot

Let’s visualize the relationship between Age and Fare.

plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"])
plt.title("Scatter Plot of Age vs Fare")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()

Interpretation:

  • Each point represents a passenger.
  • Helps check if older passengers tend to pay more/less for tickets.
  • If points form an upward trend, there’s a positive correlation.

Step 4: Adding Customizations to the Scatter Plot

plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"], alpha=0.6, edgecolor="black", color="blue")
plt.title("Customized Scatter Plot: Age vs Fare")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.grid(True)
plt.show()

Customizations:

  • alpha=0.6 → Adjusts transparency to prevent overlapping points.
  • edgecolor=”black” → Highlights individual points.
  • grid=True → Adds background grid for better readability.

Step 5: Scatter Plot with Categories (Color Encoding)

We can use hue to differentiate groups (e.g., Survival status).

plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Age"], y=df["Fare"], hue=df["Survived"], palette="coolwarm")
plt.title("Scatter Plot: Age vs Fare (Survival Status)")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()

Observations:

  • Different colors represent Survived (1) vs Not Survived (0).
  • If high-fare passengers survive more, color distribution will show it.

Step 6: Scatter Plot with Regression Line (Trend Line)

We can add a regression line to check linear relationships.

plt.figure(figsize=(8,5))
sns.regplot(x=df["Age"], y=df["Fare"], scatter_kws={"alpha":0.5}, line_kws={"color":"red"})
plt.title("Scatter Plot with Regression Line")
plt.xlabel("Age")
plt.ylabel("Fare")
plt.show()

Regression Line:

  • Red Line shows the overall trend.
  • Slope direction determines the correlation (positive/negative).

II. Pair Plots

What is a Pair Plot?

A pair plot (also called a scatterplot matrix) shows scatter plots for every pair of numerical variables in the dataset.

Why Use Pair Plots?

✔ Compares all numerical variables at once.
✔ Helps detect relationships, clusters, and trends.
✔ Includes histograms on the diagonal for individual distributions.


Step 7: Creating a Pair Plot

sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]])
plt.show()

Interpretation:

  • Each subplot is a scatter plot of two features.
  • The diagonal shows histograms (distribution of each feature).
  • Helps identify correlations & patterns.

Step 8: Pair Plot with Categories (Hue Parameter)

We can color-code the points based on a categorical feature (e.g., Survived).

sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived", palette="coolwarm")
plt.show()

Observations:

  • Different colors show how survival is distributed across features.
  • If survival depends on Fare or Age, it will be visible.

Step 9: Pair Plot with KDE for Distribution Analysis

We can change the diagonal elements from histograms to KDE plots for smooth density estimation.

sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived", diag_kind="kde")
plt.show()

Why KDE?

  • KDE gives a smooth density curve, unlike a histogram.
  • Helps understand underlying distributions better.

Step 10: Selecting Specific Features in Pair Plot

If the dataset has too many numerical columns, we can select specific ones.

features = ["Age", "Fare", "Pclass", "SibSp", "Survived"]
sns.pairplot(df[features], hue="Survived", palette="coolwarm")
plt.show()

Benefits:

  • Avoids clutter in large datasets.
  • Focuses on the most relevant variables.

Key Differences: Scatter Plot vs Pair Plot

FeatureScatter PlotPair Plot
PurposeShows relationship between two variablesCompares all numerical variables at once
DisplaysSingle scatter plotMultiple scatter plots in a grid
OutliersShows in two variablesDetects in multiple variables
Best ForUnderstanding one relationship at a timeFinding overall patterns in data

Key Takeaways

Scatter Plots reveal relationships between two numerical variables.
Pair Plots provide a complete comparison of multiple variables.
Hue parameter in both plots helps visualize categorical effects.
Regression lines in scatter plots highlight trends and correlations.
Pair Plots with KDE provide a smooth distribution analysis.


Leave a Reply

Your email address will not be published. Required fields are marked *