Data Profiling

Loading

Data Profiling: A Comprehensive Guide

Introduction

Data Profiling is the process of examining, analyzing, and summarizing data to understand its structure, quality, and characteristics. It helps data scientists and analysts detect inconsistencies, missing values, outliers, and patterns before applying machine learning or data analysis.

Why is Data Profiling Important?

  • Helps in data quality assessment (missing values, duplicates, incorrect formats).
  • Identifies outliers and anomalies in datasets.
  • Aids in data cleaning and preprocessing.
  • Supports feature engineering for machine learning.
  • Ensures data integrity and reliability.

Step 1: Importing Necessary Libraries

We use Pandas, NumPy, and Pandas Profiling (ydata-profiling) to perform data profiling.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ydata_profiling import ProfileReport  # Install via: pip install ydata-profiling

Step 2: Loading the Dataset

We’ll use the Titanic dataset for this guide.

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
print(df.head())

Output:

   PassengerId  Survived  Pclass  Name                          Sex   Age  SibSp  Parch  Ticket    Fare Cabin Embarked
0           1        0      3  Braund, Mr. Owen Harris    male  22.0      1      0   A/5 21171  7.2500   NaN        S
1           2        1      1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0      1      0   PC 17599  71.2833  C85        C

Step 3: Understanding Data Types & Basic Statistics

1. Checking Data Types

print(df.dtypes)

Output:

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
Fare           float64
Embarked        object
dtype: object

Observation:

  • Numerical columns: Age, Fare, Pclass
  • Categorical columns: Sex, Embarked
  • String columns: Name, Ticket, Cabin

2. Descriptive Statistics

print(df.describe())

Output:

       PassengerId    Survived     Pclass       Age       SibSp      Parch       Fare
count   891.000000  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
max     891.000000    1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

Observations:

  • Age has missing values (714 entries instead of 891).
  • Fare has a large standard deviation, indicating outliers.

Step 4: Identifying Missing Values

print(df.isnull().sum())

Output:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
Cabin          687
Embarked         2
dtype: int64

Handling Missing Values:

  • AgeFill with median or predict using other features
  • CabinDrop or categorize as ‘Unknown’
  • EmbarkedFill with mode (‘S’)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Cabin'].fillna('Unknown', inplace=True)

Step 5: Checking for Duplicates

print("Number of duplicate rows:", df.duplicated().sum())

Remove duplicates if found:

df.drop_duplicates(inplace=True)

Step 6: Analyzing Outliers

sns.boxplot(x=df["Fare"])
plt.show()

If outliers exist, apply transformations:

df['Fare'] = np.log1p(df['Fare'])  # Log transformation

Step 7: Analyzing Categorical Variables

1. Count of Each Category

print(df["Sex"].value_counts())
print(df["Embarked"].value_counts())

2. Visualizing Categorical Data

sns.countplot(x=df["Sex"])
plt.show()
sns.countplot(x=df["Embarked"])
plt.show()

Step 8: Generating a Data Profiling Report

Use ydata-profiling to generate a full summary.

profile = ProfileReport(df, title="Titanic Data Profiling Report", explorative=True)
profile.to_file("titanic_report.html")

The report includes:

  • Overview of data types.
  • Summary statistics.
  • Correlations between variables.
  • Missing values analysis.
  • Outliers detection.

Leave a Reply

Your email address will not be published. Required fields are marked *