Bivariate and Multivariate Analysis: A Comprehensive Guide

Introduction

What is Bivariate and Multivariate Analysis?

Bivariate Analysis: Examines the relationship between two variables.
Multivariate Analysis: Analyzes the relationship between more than two variables simultaneously.

Both techniques help in understanding dependencies, correlations, patterns, and trends in data, essential for feature selection and predictive modeling.

Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr

Step 2: Loading the Dataset

We’ll use the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

I. BIVARIATE ANALYSIS (Two-Variable Analysis)

Bivariate analysis is classified into three types based on the nature of the variables:

Numerical vs Numerical
Numerical vs Categorical
Categorical vs Categorical

Case 1: Numerical vs Numerical Variables

1. Scatter Plot (Visualizing Relationships)

Used when both variables are continuous.

sns.scatterplot(x=df["Age"], y=df["Fare"])
plt.title("Scatter Plot: Age vs Fare")
plt.show()

✅ Interpretation:

Positive Relationship: If Fare increases with Age.
Negative Relationship: If Fare decreases with Age.
No Relationship: If points are randomly scattered.

2. Correlation Analysis (Measuring Relationships)

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables.

corr, _ = pearsonr(df["Age"].dropna(), df["Fare"].dropna())
print(f"Pearson Correlation: {corr:.3f}")

✅ Interpretation:

+1 → Strong positive correlation
0 → No correlation
-1 → Strong negative correlation

3. Heatmap (Overall Correlation Matrix)

Displays correlation between multiple numerical variables.

plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

✅ Helps identify relationships across all numerical variables.

Case 2: Numerical vs Categorical Variables

1. Box Plot (Comparing Distributions)

Used to compare a numerical variable across different categories.

sns.boxplot(x=df["Survived"], y=df["Age"])
plt.title("Box Plot: Age vs Survived")
plt.show()

✅ Interpretation:

Shows median, quartiles, and outliers for each category.
Helps understand if Age differs significantly between survived and non-survived groups.

2. Violin Plot (Comparing Distributions with Density)

sns.violinplot(x=df["Pclass"], y=df["Fare"])
plt.title("Violin Plot: Pclass vs Fare")
plt.show()

✅ Shows distribution shape and data density along with box plot statistics.

3. Bar Plot (Mean of Numerical Feature per Category)

sns.barplot(x=df["Sex"], y=df["Fare"])
plt.title("Bar Plot: Average Fare Paid by Gender")
plt.show()

✅ Helps compare group-wise averages.

Case 3: Categorical vs Categorical Variables

1. Cross Tabulation (Frequency Table)

cross_tab = pd.crosstab(df["Sex"], df["Survived"])
print(cross_tab)

✅ Shows how survival varies by gender.

2. Grouped Bar Chart

sns.countplot(x="Sex", hue="Survived", data=df)
plt.title("Survival Count by Gender")
plt.show()

✅ Stacked comparison of categories across another categorical feature.

3. Chi-Square Test (Statistical Relationship)

To check if two categorical variables are independent.

from scipy.stats import chi2_contingency
chi2, p, _, _ = chi2_contingency(cross_tab)
print(f"Chi-square Test p-value: {p:.3f}")

✅ If p < 0.05, the relationship is significant.

II. MULTIVARIATE ANALYSIS (More Than Two Variables)

Multivariate analysis helps in understanding how multiple factors interact.

1. Pair Plot (Visualizing Pairwise Relationships)

sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived")
plt.show()

✅ Plots scatter plots for numerical variables and distributions for categorical variables.

2. Multivariate Correlation Heatmap

plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Multivariate Correlation Heatmap")
plt.show()

✅ Highlights relationships between multiple numerical features.

3. Multivariate Regression Analysis

Examines how multiple independent variables impact a dependent variable.

import statsmodels.api as sm

df.dropna(inplace=True)  # Removing missing values
X = df[["Pclass", "Age", "Fare"]]
y = df["Survived"]

X = sm.add_constant(X)  # Adding intercept
model = sm.Logit(y, X).fit()
print(model.summary())

✅ Used to predict survival probability based on multiple features.

4. Principal Component Analysis (PCA)

Used to reduce dimensionality while preserving variance.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = ["Age", "Fare", "Pclass"]
X = StandardScaler().fit_transform(df[features])

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:,0], X_pca[:,1], c=df["Survived"], cmap="coolwarm")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA - Titanic Data")
plt.show()

✅ Reduces high-dimensional data to two main components for visualization.

Key Insights from Bivariate & Multivariate Analysis

✔ Bivariate Analysis:

Scatter plots, correlation, and regression for numerical vs numerical relationships.
Box plots, violin plots, and bar charts for numerical vs categorical comparisons.
Cross-tabulation, count plots, and chi-square tests for categorical vs categorical relationships.

✔ Multivariate Analysis:

Pair plots & Heatmaps highlight relationships between multiple variables.
Regression & PCA help in feature selection & dimensionality reduction.

Bivariate and Multivariate Analysis: A Comprehensive Guide

Introduction

What is Bivariate and Multivariate Analysis?

Step 1: Importing Required Libraries

Step 2: Loading the Dataset

I. BIVARIATE ANALYSIS (Two-Variable Analysis)

Case 1: Numerical vs Numerical Variables

1. Scatter Plot (Visualizing Relationships)

2. Correlation Analysis (Measuring Relationships)

3. Heatmap (Overall Correlation Matrix)

Case 2: Numerical vs Categorical Variables

1. Box Plot (Comparing Distributions)

2. Violin Plot (Comparing Distributions with Density)

3. Bar Plot (Mean of Numerical Feature per Category)

Case 3: Categorical vs Categorical Variables

1. Cross Tabulation (Frequency Table)

2. Grouped Bar Chart

3. Chi-Square Test (Statistical Relationship)

II. MULTIVARIATE ANALYSIS (More Than Two Variables)

1. Pair Plot (Visualizing Pairwise Relationships)

2. Multivariate Correlation Heatmap

3. Multivariate Regression Analysis

4. Principal Component Analysis (PCA)

Key Insights from Bivariate & Multivariate Analysis

Leave a Reply Cancel reply