Univariate Analysis: A Comprehensive Guide

Introduction

Univariate Analysis is the simplest form of data analysis, where we analyze one variable at a time. The goal is to understand the distribution, central tendency, dispersion, and outliers of a single feature in a dataset.

Why is Univariate Analysis Important?

✔ Helps understand data distribution (normal, skewed, etc.).
✔ Identifies missing values and outliers.
✔ Helps in feature selection and data transformation.
✔ Provides insights into data variability (spread of values).

Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

We use the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Step 3: Types of Univariate Analysis

Univariate analysis is divided into two types based on the nature of the variable:

For Numerical Variables (e.g., Age, Fare, Salary)
- Measures of Central Tendency (Mean, Median, Mode)
- Measures of Dispersion (Variance, Standard Deviation, Range, IQR)
- Data Distribution (Histograms, KDE Plots, Boxplots)
For Categorical Variables (e.g., Gender, Embarked, Class)
- Frequency Distribution
- Count Plots, Pie Charts

Step 4: Univariate Analysis for Numerical Variables

1. Checking Summary Statistics

print(df["Age"].describe())

Output:

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000

✅ Observations:

Mean Age: 29.69 years
Median Age: 28 years
Age Range: 0.42 – 80 years
Age is right-skewed (Higher max value suggests outliers)

2. Visualizing Data Distribution

a) Histogram

Shows the frequency distribution of numerical data.

sns.histplot(df["Age"], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()

✅ Insights from Histogram:

Shows how values are distributed.
Helps detect skewness and outliers.

b) Kernel Density Estimation (KDE) Plot

A smoothed version of a histogram.

sns.kdeplot(df["Age"], shade=True)
plt.title("Age KDE Plot")
plt.show()

✅ Helps understand:

Shape of distribution (normal, skewed, etc.).
Density of values (where most values are concentrated).

c) Box Plot

Identifies outliers and spread in data.

sns.boxplot(y=df["Age"])
plt.title("Age Box Plot")
plt.show()

✅ Box Plot Interpretation:

Middle line = Median
Box edges = 25th and 75th percentiles (Interquartile Range – IQR)
Whiskers = Spread of data
Dots outside whiskers = Outliers

3. Measuring Spread and Dispersion

a) Standard Deviation & Variance

print("Standard Deviation:", df["Age"].std())
print("Variance:", df["Age"].var())

✅ High standard deviation = More spread out data.

b) Interquartile Range (IQR)

Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)

✅ IQR is useful for detecting outliers.

Step 5: Univariate Analysis for Categorical Variables

1. Frequency Distribution of Categorical Data

print(df["Sex"].value_counts())

Output:

male      577
female    314

✅ Most passengers are male.

2. Visualizing Categorical Data

a) Count Plot

sns.countplot(x=df["Sex"])
plt.title("Count Plot of Sex")
plt.show()

✅ Used to show the frequency of categorical values.

b) Pie Chart

df["Sex"].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightblue', 'pink'])
plt.title("Gender Distribution")
plt.ylabel("")
plt.show()

✅ Shows proportion of categories.

Step 6: Identifying Outliers using IQR Method

Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["Fare"] < lower_bound) | (df["Fare"] > upper_bound)]
print("Number of Outliers:", outliers.shape[0])

✅ Helps detect extreme values.

Step 7: Handling Outliers & Skewed Data

1. Applying Log Transformation (for Right-Skewed Data)

df["Fare"] = np.log1p(df["Fare"])  # log1p handles zero values

✅ Reduces skewness, making data more normal.

Step 8: Key Insights from Univariate Analysis

✔ Numerical Variables: Histograms, KDE plots, Box plots reveal distribution & outliers.
✔ Categorical Variables: Count plots & Pie charts show frequency & proportion.
✔ Dispersion Analysis: Variance, IQR, and Standard Deviation quantify spread.
✔ Outlier Detection: Box plots & IQR method identify extreme values.
✔ Skewness Handling: Log transformation can be applied to highly skewed data.