Data Profiling: A Comprehensive Guide
Introduction
Data Profiling is the process of examining, analyzing, and summarizing data to understand its structure, quality, and characteristics. It helps data scientists and analysts detect inconsistencies, missing values, outliers, and patterns before applying machine learning or data analysis.
Why is Data Profiling Important?
- Helps in data quality assessment (missing values, duplicates, incorrect formats).
- Identifies outliers and anomalies in datasets.
- Aids in data cleaning and preprocessing.
- Supports feature engineering for machine learning.
- Ensures data integrity and reliability.
Step 1: Importing Necessary Libraries
We use Pandas, NumPy, and Pandas Profiling (ydata-profiling) to perform data profiling.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport # Install via: pip install ydata-profiling
Step 2: Loading the Dataset
We’ll use the Titanic dataset for this guide.
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())
Output:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
Step 3: Understanding Data Types & Basic Statistics
1. Checking Data Types
print(df.dtypes)
Output:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
Fare float64
Embarked object
dtype: object
✅ Observation:
- Numerical columns:
Age
,Fare
,Pclass
- Categorical columns:
Sex
,Embarked
- String columns:
Name
,Ticket
,Cabin
2. Descriptive Statistics
print(df.describe())
Output:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
✅ Observations:
- Age has missing values (714 entries instead of 891).
- Fare has a large standard deviation, indicating outliers.
Step 4: Identifying Missing Values
print(df.isnull().sum())
Output:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
Cabin 687
Embarked 2
dtype: int64
✅ Handling Missing Values:
Age
→ Fill with median or predict using other featuresCabin
→ Drop or categorize as ‘Unknown’Embarked
→ Fill with mode (‘S’)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Cabin'].fillna('Unknown', inplace=True)
Step 5: Checking for Duplicates
print("Number of duplicate rows:", df.duplicated().sum())
✅ Remove duplicates if found:
df.drop_duplicates(inplace=True)
Step 6: Analyzing Outliers
sns.boxplot(x=df["Fare"])
plt.show()
If outliers exist, apply transformations:
df['Fare'] = np.log1p(df['Fare']) # Log transformation
Step 7: Analyzing Categorical Variables
1. Count of Each Category
print(df["Sex"].value_counts())
print(df["Embarked"].value_counts())
2. Visualizing Categorical Data
sns.countplot(x=df["Sex"])
plt.show()
sns.countplot(x=df["Embarked"])
plt.show()
Step 8: Generating a Data Profiling Report
Use ydata-profiling to generate a full summary.
profile = ProfileReport(df, title="Titanic Data Profiling Report", explorative=True)
profile.to_file("titanic_report.html")
✅ The report includes:
- Overview of data types.
- Summary statistics.
- Correlations between variables.
- Missing values analysis.
- Outliers detection.