Univariate Analysis: A Comprehensive Guide
Introduction
Univariate Analysis is the simplest form of data analysis, where we analyze one variable at a time. The goal is to understand the distribution, central tendency, dispersion, and outliers of a single feature in a dataset.
Why is Univariate Analysis Important?
✔ Helps understand data distribution (normal, skewed, etc.).
✔ Identifies missing values and outliers.
✔ Helps in feature selection and data transformation.
✔ Provides insights into data variability (spread of values).
Step 1: Importing Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Step 2: Loading the Dataset
We use the Titanic dataset for demonstration.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())
Step 3: Types of Univariate Analysis
Univariate analysis is divided into two types based on the nature of the variable:
- For Numerical Variables (e.g., Age, Fare, Salary)
- Measures of Central Tendency (Mean, Median, Mode)
- Measures of Dispersion (Variance, Standard Deviation, Range, IQR)
- Data Distribution (Histograms, KDE Plots, Boxplots)
- For Categorical Variables (e.g., Gender, Embarked, Class)
- Frequency Distribution
- Count Plots, Pie Charts
Step 4: Univariate Analysis for Numerical Variables
1. Checking Summary Statistics
print(df["Age"].describe())
Output:
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
✅ Observations:
- Mean Age: 29.69 years
- Median Age: 28 years
- Age Range: 0.42 – 80 years
- Age is right-skewed (Higher max value suggests outliers)
2. Visualizing Data Distribution
a) Histogram
Shows the frequency distribution of numerical data.
sns.histplot(df["Age"], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()
✅ Insights from Histogram:
- Shows how values are distributed.
- Helps detect skewness and outliers.
b) Kernel Density Estimation (KDE) Plot
A smoothed version of a histogram.
sns.kdeplot(df["Age"], shade=True)
plt.title("Age KDE Plot")
plt.show()
✅ Helps understand:
- Shape of distribution (normal, skewed, etc.).
- Density of values (where most values are concentrated).
c) Box Plot
Identifies outliers and spread in data.
sns.boxplot(y=df["Age"])
plt.title("Age Box Plot")
plt.show()
✅ Box Plot Interpretation:
- Middle line = Median
- Box edges = 25th and 75th percentiles (Interquartile Range – IQR)
- Whiskers = Spread of data
- Dots outside whiskers = Outliers
3. Measuring Spread and Dispersion
a) Standard Deviation & Variance
print("Standard Deviation:", df["Age"].std())
print("Variance:", df["Age"].var())
✅ High standard deviation = More spread out data.
b) Interquartile Range (IQR)
Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)
✅ IQR is useful for detecting outliers.
Step 5: Univariate Analysis for Categorical Variables
1. Frequency Distribution of Categorical Data
print(df["Sex"].value_counts())
Output:
male 577
female 314
✅ Most passengers are male.
2. Visualizing Categorical Data
a) Count Plot
sns.countplot(x=df["Sex"])
plt.title("Count Plot of Sex")
plt.show()
✅ Used to show the frequency of categorical values.
b) Pie Chart
df["Sex"].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightblue', 'pink'])
plt.title("Gender Distribution")
plt.ylabel("")
plt.show()
✅ Shows proportion of categories.
Step 6: Identifying Outliers using IQR Method
Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["Fare"] < lower_bound) | (df["Fare"] > upper_bound)]
print("Number of Outliers:", outliers.shape[0])
✅ Helps detect extreme values.
Step 7: Handling Outliers & Skewed Data
1. Applying Log Transformation (for Right-Skewed Data)
df["Fare"] = np.log1p(df["Fare"]) # log1p handles zero values
✅ Reduces skewness, making data more normal.
Step 8: Key Insights from Univariate Analysis
✔ Numerical Variables: Histograms, KDE plots, Box plots reveal distribution & outliers.
✔ Categorical Variables: Count plots & Pie charts show frequency & proportion.
✔ Dispersion Analysis: Variance, IQR, and Standard Deviation quantify spread.
✔ Outlier Detection: Box plots & IQR method identify extreme values.
✔ Skewness Handling: Log transformation can be applied to highly skewed data.