Univariate Analysis

Loading

Univariate Analysis: A Comprehensive Guide

Introduction

Univariate Analysis is the simplest form of data analysis, where we analyze one variable at a time. The goal is to understand the distribution, central tendency, dispersion, and outliers of a single feature in a dataset.

Why is Univariate Analysis Important?

✔ Helps understand data distribution (normal, skewed, etc.).
✔ Identifies missing values and outliers.
✔ Helps in feature selection and data transformation.
✔ Provides insights into data variability (spread of values).


Step 1: Importing Required Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Loading the Dataset

We use the Titanic dataset for demonstration.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Step 3: Types of Univariate Analysis

Univariate analysis is divided into two types based on the nature of the variable:

  1. For Numerical Variables (e.g., Age, Fare, Salary)
    • Measures of Central Tendency (Mean, Median, Mode)
    • Measures of Dispersion (Variance, Standard Deviation, Range, IQR)
    • Data Distribution (Histograms, KDE Plots, Boxplots)
  2. For Categorical Variables (e.g., Gender, Embarked, Class)
    • Frequency Distribution
    • Count Plots, Pie Charts

Step 4: Univariate Analysis for Numerical Variables

1. Checking Summary Statistics

print(df["Age"].describe())

Output:

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000

Observations:

  • Mean Age: 29.69 years
  • Median Age: 28 years
  • Age Range: 0.42 – 80 years
  • Age is right-skewed (Higher max value suggests outliers)

2. Visualizing Data Distribution

a) Histogram

Shows the frequency distribution of numerical data.

sns.histplot(df["Age"], bins=30, kde=True)
plt.title("Age Distribution")
plt.show()

Insights from Histogram:

  • Shows how values are distributed.
  • Helps detect skewness and outliers.

b) Kernel Density Estimation (KDE) Plot

A smoothed version of a histogram.

sns.kdeplot(df["Age"], shade=True)
plt.title("Age KDE Plot")
plt.show()

Helps understand:

  • Shape of distribution (normal, skewed, etc.).
  • Density of values (where most values are concentrated).

c) Box Plot

Identifies outliers and spread in data.

sns.boxplot(y=df["Age"])
plt.title("Age Box Plot")
plt.show()

Box Plot Interpretation:

  • Middle line = Median
  • Box edges = 25th and 75th percentiles (Interquartile Range – IQR)
  • Whiskers = Spread of data
  • Dots outside whiskers = Outliers

3. Measuring Spread and Dispersion

a) Standard Deviation & Variance

print("Standard Deviation:", df["Age"].std())
print("Variance:", df["Age"].var())

High standard deviation = More spread out data.

b) Interquartile Range (IQR)

Q1 = df["Age"].quantile(0.25)
Q3 = df["Age"].quantile(0.75)
IQR = Q3 - Q1
print("Interquartile Range (IQR):", IQR)

IQR is useful for detecting outliers.


Step 5: Univariate Analysis for Categorical Variables

1. Frequency Distribution of Categorical Data

print(df["Sex"].value_counts())

Output:

male      577
female    314

Most passengers are male.


2. Visualizing Categorical Data

a) Count Plot

sns.countplot(x=df["Sex"])
plt.title("Count Plot of Sex")
plt.show()

Used to show the frequency of categorical values.


b) Pie Chart

df["Sex"].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightblue', 'pink'])
plt.title("Gender Distribution")
plt.ylabel("")
plt.show()

Shows proportion of categories.


Step 6: Identifying Outliers using IQR Method

Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df["Fare"] < lower_bound) | (df["Fare"] > upper_bound)]
print("Number of Outliers:", outliers.shape[0])

Helps detect extreme values.


Step 7: Handling Outliers & Skewed Data

1. Applying Log Transformation (for Right-Skewed Data)

df["Fare"] = np.log1p(df["Fare"])  # log1p handles zero values

Reduces skewness, making data more normal.


Step 8: Key Insights from Univariate Analysis

Numerical Variables: Histograms, KDE plots, Box plots reveal distribution & outliers.
Categorical Variables: Count plots & Pie charts show frequency & proportion.
Dispersion Analysis: Variance, IQR, and Standard Deviation quantify spread.
Outlier Detection: Box plots & IQR method identify extreme values.
Skewness Handling: Log transformation can be applied to highly skewed data.


Leave a Reply

Your email address will not be published. Required fields are marked *