Pandas for Data Manipulation: A Comprehensive Guide
1. Introduction to Pandas
Pandas (Python Data Analysis Library) is a powerful and flexible tool for data manipulation, analysis, and cleaning. It is widely used in data science, machine learning, and business analytics to handle structured data efficiently.
✅ Why Use Pandas for Data Manipulation?
✔ Handles structured data (CSV, Excel, SQL, JSON, etc.)
✔ Provides DataFrames and Series (tabular data like Excel)
✔ Supports fast filtering, sorting, and grouping
✔ Has built-in functions for statistics, aggregation, and transformation
✔ Works well with NumPy, Matplotlib, Seaborn, and Scikit-Learn
📌 Use Cases of Pandas:
- Data Cleaning – Handling missing values, duplicates, outliers
- Data Wrangling – Filtering, merging, reshaping datasets
- Exploratory Data Analysis (EDA) – Understanding dataset structure
- Data Transformation – Aggregation, applying custom functions
- Financial Analysis – Time-series data processing
- Machine Learning Preprocessing – Preparing structured data for ML models
2. Installing and Importing Pandas
A. Install Pandas
pip install pandas
B. Import Pandas
import pandas as pd
print(pd.__version__) # Check the installed version
3. Pandas Data Structures: Series & DataFrames
A. Series (1D Data Structure)
A Series is a one-dimensional labeled array, similar to a column in Excel.
import pandas as pd
# Creating a Pandas Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)
📌 Key Features:
✔ Like a NumPy array but with index labels
✔ Supports mathematical operations
✔ Used for handling single columns of data
B. DataFrame (2D Data Structure)
A DataFrame is a tabular, two-dimensional structure (rows & columns).
# Creating a Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)
📌 Key Features:
✔ Rows and columns with labels (index & column names)
✔ Can store different data types (int, float, string, etc.)
✔ Supports data filtering, aggregation, and transformation
4. Loading and Exporting Data
A. Loading Data from Files
📌 Read CSV (Comma-Separated Values)
df = pd.read_csv("data.csv")
print(df.head()) # Show first 5 rows
📌 Read Excel File
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
print(df.head())
📌 Read JSON File
df = pd.read_json("data.json")
📌 Read SQL Database
import sqlite3
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM employees", conn)
B. Exporting Data to Files
📌 Save as CSV
df.to_csv("output.csv", index=False)
📌 Save as Excel
df.to_excel("output.xlsx", sheet_name="Sheet1", index=False)
📌 Save as JSON
df.to_json("output.json")
5. Exploring and Understanding Data
# Display first & last rows
print(df.head()) # First 5 rows
print(df.tail()) # Last 5 rows
# Get DataFrame info
print(df.info())
# Summary statistics
print(df.describe())
# Check shape (rows, columns)
print(df.shape)
# Check column names
print(df.columns)
# Check data types
print(df.dtypes)
📌 Key Insights:
✔ describe()
→ Shows statistical summary (mean, std, min, max, etc.)
✔ info()
→ Displays memory usage & data types
✔ columns
→ Lists all column names
6. Selecting & Filtering Data
A. Selecting Columns
# Select a single column
print(df["Name"])
# Select multiple columns
print(df[["Name", "Salary"]])
B. Selecting Rows Using iloc
& loc
# Select rows by index position
print(df.iloc[0]) # First row
print(df.iloc[1:3]) # Rows from index 1 to 2
# Select rows by index label
print(df.loc[0]) # First row
print(df.loc[df["Age"] > 30]) # Filter rows where Age > 30
📌 Key Differences:
✔ iloc[]
→ Selects by position (numeric index)
✔ loc[]
→ Selects by label (row/column names)
7. Handling Missing Data
A. Detecting Missing Values
print(df.isnull().sum()) # Count missing values in each column
B. Filling Missing Values
df.fillna(0, inplace=True) # Replace NaN with 0
df.fillna(df.mean(), inplace=True) # Replace NaN with column mean
C. Dropping Missing Values
df.dropna(inplace=True) # Remove rows with NaN values
📌 Best Practices:
✔ Use fillna()
when you need to replace missing values
✔ Use dropna()
if you want to remove missing rows
8. Sorting & Grouping Data
A. Sorting Data
df.sort_values(by="Salary", ascending=False, inplace=True)
B. Grouping Data
grouped = df.groupby("Department")["Salary"].mean()
print(grouped)
📌 Use Cases:
✔ sort_values()
→ Sorts data by column values
✔ groupby()
→ Aggregates data (sum, mean, count, etc.)
9. Applying Functions to Data
A. Using apply()
for Column-Wise Operations
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1) # Increase salary by 10%
B. Using map()
for Single Column Transformation
df["Name"] = df["Name"].map(str.upper) # Convert names to uppercase
📌 Use Cases:
✔ apply()
→ Used for complex row/column transformations
✔ map()
→ Used for simple transformations
10. Merging, Joining & Concatenating Data
A. Merging Two DataFrames (SQL-Style Joins)
merged_df = pd.merge(df1, df2, on="EmployeeID", how="inner")
📌 Join Types:
✔ inner
→ Common records only
✔ left
→ Keep all records from df1
✔ right
→ Keep all records from df2
✔ outer
→ Keep all records from both
B. Concatenating Multiple DataFrames
concat_df = pd.concat([df1, df2], axis=0) # Stack rows
📌 Use Cases:
✔ Merging → Combine data from different sources
✔ Concatenation → Append new data
11. Summary
✔ Pandas is essential for data manipulation & analysis
✔ DataFrames provide an efficient way to handle structured data
✔ Supports filtering, grouping, sorting, and transformation
✔ Used for machine learning preprocessing and business analytics
📌 Next Steps:
✅ Learn Data Visualization with Pandas & Matplotlib
✅ Use Pandas for Machine Learning Preprocessing
✅ Work on real-world datasets (Kaggle, UCI, etc.)
Need hands-on Pandas projects? Let me know!