Bivariate and Multivariate Analysis

Loading

Bivariate and Multivariate Analysis: A Comprehensive Guide

Introduction

What is Bivariate and Multivariate Analysis?

  • Bivariate Analysis: Examines the relationship between two variables.
  • Multivariate Analysis: Analyzes the relationship between more than two variables simultaneously.

Both techniques help in understanding dependencies, correlations, patterns, and trends in data, essential for feature selection and predictive modeling.


Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr

Step 2: Loading the Dataset

We’ll use the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

I. BIVARIATE ANALYSIS (Two-Variable Analysis)

Bivariate analysis is classified into three types based on the nature of the variables:

  1. Numerical vs Numerical
  2. Numerical vs Categorical
  3. Categorical vs Categorical

Case 1: Numerical vs Numerical Variables

1. Scatter Plot (Visualizing Relationships)

Used when both variables are continuous.

sns.scatterplot(x=df["Age"], y=df["Fare"])
plt.title("Scatter Plot: Age vs Fare")
plt.show()

Interpretation:

  • Positive Relationship: If Fare increases with Age.
  • Negative Relationship: If Fare decreases with Age.
  • No Relationship: If points are randomly scattered.

2. Correlation Analysis (Measuring Relationships)

The Pearson correlation coefficient measures the strength and direction of a linear relationship between two numerical variables.

corr, _ = pearsonr(df["Age"].dropna(), df["Fare"].dropna())
print(f"Pearson Correlation: {corr:.3f}")

Interpretation:

  • +1 → Strong positive correlation
  • 0 → No correlation
  • -1 → Strong negative correlation

3. Heatmap (Overall Correlation Matrix)

Displays correlation between multiple numerical variables.

plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

Helps identify relationships across all numerical variables.


Case 2: Numerical vs Categorical Variables

1. Box Plot (Comparing Distributions)

Used to compare a numerical variable across different categories.

sns.boxplot(x=df["Survived"], y=df["Age"])
plt.title("Box Plot: Age vs Survived")
plt.show()

Interpretation:

  • Shows median, quartiles, and outliers for each category.
  • Helps understand if Age differs significantly between survived and non-survived groups.

2. Violin Plot (Comparing Distributions with Density)

sns.violinplot(x=df["Pclass"], y=df["Fare"])
plt.title("Violin Plot: Pclass vs Fare")
plt.show()

Shows distribution shape and data density along with box plot statistics.


3. Bar Plot (Mean of Numerical Feature per Category)

sns.barplot(x=df["Sex"], y=df["Fare"])
plt.title("Bar Plot: Average Fare Paid by Gender")
plt.show()

Helps compare group-wise averages.


Case 3: Categorical vs Categorical Variables

1. Cross Tabulation (Frequency Table)

cross_tab = pd.crosstab(df["Sex"], df["Survived"])
print(cross_tab)

Shows how survival varies by gender.


2. Grouped Bar Chart

sns.countplot(x="Sex", hue="Survived", data=df)
plt.title("Survival Count by Gender")
plt.show()

Stacked comparison of categories across another categorical feature.


3. Chi-Square Test (Statistical Relationship)

To check if two categorical variables are independent.

from scipy.stats import chi2_contingency
chi2, p, _, _ = chi2_contingency(cross_tab)
print(f"Chi-square Test p-value: {p:.3f}")

If p < 0.05, the relationship is significant.


II. MULTIVARIATE ANALYSIS (More Than Two Variables)

Multivariate analysis helps in understanding how multiple factors interact.

1. Pair Plot (Visualizing Pairwise Relationships)

sns.pairplot(df[["Age", "Fare", "Pclass", "Survived"]], hue="Survived")
plt.show()

Plots scatter plots for numerical variables and distributions for categorical variables.


2. Multivariate Correlation Heatmap

plt.figure(figsize=(8,5))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", linewidths=0.5)
plt.title("Multivariate Correlation Heatmap")
plt.show()

Highlights relationships between multiple numerical features.


3. Multivariate Regression Analysis

Examines how multiple independent variables impact a dependent variable.

import statsmodels.api as sm

df.dropna(inplace=True)  # Removing missing values
X = df[["Pclass", "Age", "Fare"]]
y = df["Survived"]

X = sm.add_constant(X)  # Adding intercept
model = sm.Logit(y, X).fit()
print(model.summary())

Used to predict survival probability based on multiple features.


4. Principal Component Analysis (PCA)

Used to reduce dimensionality while preserving variance.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = ["Age", "Fare", "Pclass"]
X = StandardScaler().fit_transform(df[features])

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:,0], X_pca[:,1], c=df["Survived"], cmap="coolwarm")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA - Titanic Data")
plt.show()

Reduces high-dimensional data to two main components for visualization.


Key Insights from Bivariate & Multivariate Analysis

Bivariate Analysis:

  • Scatter plots, correlation, and regression for numerical vs numerical relationships.
  • Box plots, violin plots, and bar charts for numerical vs categorical comparisons.
  • Cross-tabulation, count plots, and chi-square tests for categorical vs categorical relationships.

Multivariate Analysis:

  • Pair plots & Heatmaps highlight relationships between multiple variables.
  • Regression & PCA help in feature selection & dimensionality reduction.

Leave a Reply

Your email address will not be published. Required fields are marked *