![]()
Bivariate and Multivariate Analysis: A Comprehensive Guide
Introduction
What is Bivariate and Multivariate Analysis?
- Bivariate Analysis: Examines the relationship between two variables.
- Multivariate Analysis: Analyzes the relationship between more than two variables simultaneously.
Both techniques help in understanding dependencies, correlations, patterns, and trends in data, essential for feature selection and predictive modeling.
Step 1: Importing Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr
Step 2: Loading the Dataset
We’ll use the Titanic dataset for demonstration.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())
I. BIVARIATE ANALYSIS (Two-Variable Analysis)
Bivariate analysis is classified into three types based on the nature of the variables:
- Numerical vs Numerical
- Numerical vs Categorical
- Categorical vs Categorical
Case 1: Numerical vs Numerical Variables
1. Scatter Plot (Visualizing Relationships)
Used when both variables are continuous.
sns.scatterplot(x=df["Age"], y=df["Fare"])
plt.title("Scatter Plot: Age vs Fare")
plt.show()
✅ Interpretation:
- Positive Relationship: If
Fareincreases withAge. - Negative Relationship: If
Faredecreases withAge. - No Relationship: If points are randomly scattered.
2. Correlation Analysis (Measuring Relationships)
The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables.
corr, _ = pearsonr(df["Age"].dropna(), df["Fare"].dropna())
print(f"Pearson Correlation: {corr:.3f}")
✅ Interpretation:
- +1 → Strong positive correlation
- 0 → No correlation
- -1 → Strong negative correlation
3. Heatmap (Overall Correlation Matrix)
Displays correlation between multiple numerical variables.
plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
✅ Helps identify relationships across all numerical variables.
Case 2: Numerical vs Categorical Variables
1. Box Plot (Comparing Distributions)
Used to compare a numerical variable across different categories.
sns.boxplot(x=df["Survived"], y=df["Age"])
plt.title("Box Plot: Age vs Survived")
plt.show()
✅ Interpretation:
- Shows median, quartiles, and outliers for each category.
- Helps understand if
Agediffers significantly between survived and non-survived groups.
2. Violin Plot (Comparing Distributions with Density)
sns.violinplot(x=df["Pclass"], y=df["Fare"])
plt.title("Violin Plot: Pclass vs Fare")
plt.show()
✅ Shows distribution shape and data density along with box plot statistics.
3. Bar Plot (Mean of Numerical Feature per Category)
sns.barplot(x=df["Sex"], y=df["Fare"])
plt.title("Bar Plot: Average Fare Paid by Gender")
plt.show()
✅ Helps compare group-wise averages.
Case 3: Categorical vs Categorical Variables
1. Cross Tabulation (Frequency Table)
cross_tab = pd.crosstab(df["Sex"], df["Survived"])
print(cross_tab)
✅ Shows how survival varies by gender.
2. Grouped Bar Chart
sns.countplot(x="Sex", hue="Survived", data=df)
plt.title("Survival Count by Gender")
plt.show()
✅ Stacked comparison of categories across another categorical feature.
3. Chi-Square Test (Statistical Relationship)
To check if two categorical variables are independent.
from scipy.stats import chi2_contingency
chi2, p, _, _ = chi2_contingency(cross_tab)
print(f"Chi-square Test p-value: {p:.3f}")
✅ If p < 0.05, the relationship is significant.
II. MULTIVARIATE ANALYSIS (More Than Two Variables)
Multivariate analysis helps in understanding how multiple factors interact.
1. Pair Plot (Visualizing Pairwise Relationships)
sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived")
plt.show()
✅ Plots scatter plots for numerical variables and distributions for categorical variables.
2. Multivariate Correlation Heatmap
plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Multivariate Correlation Heatmap")
plt.show()
✅ Highlights relationships between multiple numerical features.
3. Multivariate Regression Analysis
Examines how multiple independent variables impact a dependent variable.
import statsmodels.api as sm
df.dropna(inplace=True) # Removing missing values
X = df[["Pclass", "Age", "Fare"]]
y = df["Survived"]
X = sm.add_constant(X) # Adding intercept
model = sm.Logit(y, X).fit()
print(model.summary())
✅ Used to predict survival probability based on multiple features.
4. Principal Component Analysis (PCA)
Used to reduce dimensionality while preserving variance.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
features = ["Age", "Fare", "Pclass"]
X = StandardScaler().fit_transform(df[features])
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.scatter(X_pca[:,0], X_pca[:,1], c=df["Survived"], cmap="coolwarm")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA - Titanic Data")
plt.show()
✅ Reduces high-dimensional data to two main components for visualization.
Key Insights from Bivariate & Multivariate Analysis
✔ Bivariate Analysis:
- Scatter plots, correlation, and regression for numerical vs numerical relationships.
- Box plots, violin plots, and bar charts for numerical vs categorical comparisons.
- Cross-tabulation, count plots, and chi-square tests for categorical vs categorical relationships.
✔ Multivariate Analysis:
- Pair plots & Heatmaps highlight relationships between multiple variables.
- Regression & PCA help in feature selection & dimensionality reduction.
