Pandas for Data Manipulation: A Comprehensive Guide

1. Introduction to Pandas

Pandas (Python Data Analysis Library) is a powerful and flexible tool for data manipulation, analysis, and cleaning. It is widely used in data science, machine learning, and business analytics to handle structured data efficiently.

✅ Why Use Pandas for Data Manipulation?
✔ Handles structured data (CSV, Excel, SQL, JSON, etc.)
✔ Provides DataFrames and Series (tabular data like Excel)
✔ Supports fast filtering, sorting, and grouping
✔ Has built-in functions for statistics, aggregation, and transformation
✔ Works well with NumPy, Matplotlib, Seaborn, and Scikit-Learn

📌 Use Cases of Pandas:

Data Cleaning – Handling missing values, duplicates, outliers
Data Wrangling – Filtering, merging, reshaping datasets
Exploratory Data Analysis (EDA) – Understanding dataset structure
Data Transformation – Aggregation, applying custom functions
Financial Analysis – Time-series data processing
Machine Learning Preprocessing – Preparing structured data for ML models

2. Installing and Importing Pandas

A. Install Pandas

pip install pandas

B. Import Pandas

import pandas as pd
print(pd.__version__)  # Check the installed version

3. Pandas Data Structures: Series & DataFrames

A. Series (1D Data Structure)

A Series is a one-dimensional labeled array, similar to a column in Excel.

import pandas as pd

# Creating a Pandas Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)

📌 Key Features:
✔ Like a NumPy array but with index labels
✔ Supports mathematical operations
✔ Used for handling single columns of data

B. DataFrame (2D Data Structure)

A DataFrame is a tabular, two-dimensional structure (rows & columns).

# Creating a Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)

📌 Key Features:
✔ Rows and columns with labels (index & column names)
✔ Can store different data types (int, float, string, etc.)
✔ Supports data filtering, aggregation, and transformation

4. Loading and Exporting Data

A. Loading Data from Files

📌 Read CSV (Comma-Separated Values)

df = pd.read_csv("data.csv")
print(df.head())  # Show first 5 rows

📌 Read Excel File

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
print(df.head())

📌 Read JSON File

df = pd.read_json("data.json")

📌 Read SQL Database

import sqlite3

conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM employees", conn)

B. Exporting Data to Files

📌 Save as CSV

df.to_csv("output.csv", index=False)

📌 Save as Excel

df.to_excel("output.xlsx", sheet_name="Sheet1", index=False)

📌 Save as JSON

df.to_json("output.json")

5. Exploring and Understanding Data

# Display first & last rows
print(df.head())  # First 5 rows
print(df.tail())  # Last 5 rows

# Get DataFrame info
print(df.info())

# Summary statistics
print(df.describe())

# Check shape (rows, columns)
print(df.shape)

# Check column names
print(df.columns)

# Check data types
print(df.dtypes)

📌 Key Insights:
✔ describe() → Shows statistical summary (mean, std, min, max, etc.)
✔ info() → Displays memory usage & data types
✔ columns → Lists all column names

6. Selecting & Filtering Data

A. Selecting Columns

# Select a single column
print(df["Name"])

# Select multiple columns
print(df[["Name", "Salary"]])

B. Selecting Rows Using `iloc` & `loc`

# Select rows by index position
print(df.iloc[0])  # First row
print(df.iloc[1:3])  # Rows from index 1 to 2

# Select rows by index label
print(df.loc[0])  # First row
print(df.loc[df["Age"] > 30])  # Filter rows where Age > 30

📌 Key Differences:
✔ iloc[] → Selects by position (numeric index)
✔ loc[] → Selects by label (row/column names)

7. Handling Missing Data

A. Detecting Missing Values

print(df.isnull().sum())  # Count missing values in each column

B. Filling Missing Values

df.fillna(0, inplace=True)  # Replace NaN with 0
df.fillna(df.mean(), inplace=True)  # Replace NaN with column mean

C. Dropping Missing Values

df.dropna(inplace=True)  # Remove rows with NaN values

📌 Best Practices:
✔ Use fillna() when you need to replace missing values
✔ Use dropna() if you want to remove missing rows

8. Sorting & Grouping Data

A. Sorting Data

df.sort_values(by="Salary", ascending=False, inplace=True)

B. Grouping Data

grouped = df.groupby("Department")["Salary"].mean()
print(grouped)

📌 Use Cases:
✔ sort_values() → Sorts data by column values
✔ groupby() → Aggregates data (sum, mean, count, etc.)

9. Applying Functions to Data

A. Using `apply()` for Column-Wise Operations

df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)  # Increase salary by 10%

B. Using `map()` for Single Column Transformation

df["Name"] = df["Name"].map(str.upper)  # Convert names to uppercase

📌 Use Cases:
✔ apply() → Used for complex row/column transformations
✔ map() → Used for simple transformations

10. Merging, Joining & Concatenating Data

A. Merging Two DataFrames (SQL-Style Joins)

merged_df = pd.merge(df1, df2, on="EmployeeID", how="inner")

📌 Join Types:
✔ inner → Common records only
✔ left → Keep all records from df1
✔ right → Keep all records from df2
✔ outer → Keep all records from both

B. Concatenating Multiple DataFrames

concat_df = pd.concat([df1, df2], axis=0)  # Stack rows

📌 Use Cases:
✔ Merging → Combine data from different sources
✔ Concatenation → Append new data

11. Summary

✔ Pandas is essential for data manipulation & analysis
✔ DataFrames provide an efficient way to handle structured data
✔ Supports filtering, grouping, sorting, and transformation
✔ Used for machine learning preprocessing and business analytics

📌 Next Steps:
✅ Learn Data Visualization with Pandas & Matplotlib
✅ Use Pandas for Machine Learning Preprocessing
✅ Work on real-world datasets (Kaggle, UCI, etc.)

Need hands-on Pandas projects? Let me know!