Exploratory Data Analysis (EDA) in Python

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process where you analyze the dataset to summarize its main characteristics, often visualizing them to identify patterns, trends, relationships, and anomalies. EDA helps you understand the structure of the data, detect errors, and prepare it for further analysis or modeling.

In this guide, we’ll go through the common techniques and tools used for EDA in Python, leveraging libraries like Pandas, Matplotlib, Seaborn, and NumPy.

1. Steps in Exploratory Data Analysis (EDA)

1.1. Understand the Data Structure

The first step in EDA is to load and inspect the dataset to understand its structure. This involves looking at the number of rows and columns, the types of variables (categorical or numerical), and any missing or duplicate values.

import pandas as pd

# Load the dataset
df = pd.read_csv('data.csv')

# View the first few rows of the dataset
print(df.head())

# View summary statistics of the dataset
print(df.describe())

# Check the data types of each column
print(df.info())

head(): Displays the first few rows of the DataFrame (usually the top 5 rows).
describe(): Provides summary statistics for numerical columns (mean, median, min, max, etc.).
info(): Displays information about the dataset, such as the number of rows, columns, and data types.

1.2. Handle Missing Data

Data often contains missing values, which need to be identified and handled appropriately. You can handle missing data by either dropping or filling the missing values.

# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df.dropna(inplace=True)

# Fill missing values with a specific value (e.g., 0)
df.fillna(0, inplace=True)

1.3. Identify Duplicates

Duplicates in the dataset can skew the analysis. You can check for and remove duplicate records.

# Check for duplicate rows
print(df.duplicated().sum())

# Remove duplicate rows
df.drop_duplicates(inplace=True)

2. Univariate Analysis

Univariate analysis involves examining the distribution and statistics of individual variables. For numerical variables, this can involve looking at measures like the mean, median, and distribution, while for categorical variables, it can involve analyzing the frequency of each category.

2.1. Visualizing Numerical Data

For numerical variables, histograms, boxplots, and density plots are helpful for understanding the distribution and potential outliers.

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram for a numerical column
sns.histplot(df['Age'], kde=True)
plt.show()

# Boxplot for identifying outliers
sns.boxplot(x=df['Age'])
plt.show()

2.2. Visualizing Categorical Data

For categorical variables, bar charts are commonly used to understand the frequency of each category.

# Bar plot for a categorical column
sns.countplot(x='City', data=df)
plt.show()

3. Bivariate Analysis

Bivariate analysis involves exploring the relationship between two variables, typically focusing on how one variable changes in relation to another.

3.1. Correlation Between Numerical Variables

You can analyze the correlation between numerical variables using a correlation matrix. This matrix shows the relationships between variables on a scale from -1 to 1.

# Correlation matrix for numerical columns
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

3.2. Scatter Plots

For two numerical variables, a scatter plot is useful for visualizing the relationship between them.

# Scatter plot between Age and Salary
sns.scatterplot(x='Age', y='Salary', data=df)
plt.show()

3.3. Pair Plots

A pair plot allows you to visualize relationships between multiple numerical variables at once. It generates scatter plots for each pair of variables, along with histograms on the diagonal.

# Pairplot for multiple numerical columns
sns.pairplot(df[['Age', 'Salary', 'Experience']])
plt.show()

3.4. Categorical vs. Numerical Data

For analyzing the relationship between categorical and numerical variables, you can use boxplots, violin plots, or bar plots.

# Boxplot for numerical vs. categorical data
sns.boxplot(x='City', y='Salary', data=df)
plt.show()

# Violin plot for numerical vs. categorical data
sns.violinplot(x='City', y='Salary', data=df)
plt.show()

4. Multivariate Analysis

Multivariate analysis involves examining the relationships between more than two variables at once. Visualizations such as pair plots, heatmaps, and 3D scatter plots are used to explore interactions between multiple variables.

4.1. Heatmap for Multivariate Correlations

A heatmap can be used to visualize correlations between multiple variables, which is useful in identifying collinearity and patterns.

# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

4.2. Grouped Data Analysis

You can analyze relationships between multiple variables by grouping data and applying aggregation functions such as mean(), sum(), etc.

# Group data by City and calculate mean Salary
grouped_data = df.groupby('City')['Salary'].mean()
print(grouped_data)

5. Feature Engineering and Data Transformation

During EDA, you may need to create new features or transform existing features to make them more useful for analysis or modeling.

5.1. Creating New Features

For example, you can create a new feature based on existing ones, such as categorizing age groups.

# Creating a new feature 'Age Group' based on 'Age'
df['Age Group'] = pd.cut(df['Age'], bins=[0, 18, 30, 40, 50, 100], labels=['0-18', '18-30', '30-40', '40-50', '50+'])

5.2. Normalization and Scaling

Numerical features may need to be normalized or scaled, especially for machine learning models. You can use MinMaxScaler or StandardScaler for this purpose.

from sklearn.preprocessing import MinMaxScaler

# Min-Max scaling for numerical columns
scaler = MinMaxScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

6. Outlier Detection

Outliers can significantly impact the analysis and results. Identifying outliers during EDA helps in making decisions about how to handle them (either remove or treat them).

6.1. Outlier Detection Using IQR

The Interquartile Range (IQR) is commonly used to detect outliers. Data points outside the range of 1.5 * IQR from the first and third quartiles are considered outliers.

Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1

# Detect outliers
outliers = df[(df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1.5 * IQR))]
print(outliers)