Pandas for Data Manipulation

Loading

Pandas for Data Manipulation: A Comprehensive Guide

1. Introduction to Pandas

Pandas (Python Data Analysis Library) is a powerful and flexible tool for data manipulation, analysis, and cleaning. It is widely used in data science, machine learning, and business analytics to handle structured data efficiently.

Why Use Pandas for Data Manipulation?
✔ Handles structured data (CSV, Excel, SQL, JSON, etc.)
✔ Provides DataFrames and Series (tabular data like Excel)
✔ Supports fast filtering, sorting, and grouping
✔ Has built-in functions for statistics, aggregation, and transformation
✔ Works well with NumPy, Matplotlib, Seaborn, and Scikit-Learn

📌 Use Cases of Pandas:

  • Data Cleaning – Handling missing values, duplicates, outliers
  • Data Wrangling – Filtering, merging, reshaping datasets
  • Exploratory Data Analysis (EDA) – Understanding dataset structure
  • Data Transformation – Aggregation, applying custom functions
  • Financial Analysis – Time-series data processing
  • Machine Learning Preprocessing – Preparing structured data for ML models

2. Installing and Importing Pandas

A. Install Pandas

pip install pandas

B. Import Pandas

import pandas as pd
print(pd.__version__)  # Check the installed version

3. Pandas Data Structures: Series & DataFrames

A. Series (1D Data Structure)

A Series is a one-dimensional labeled array, similar to a column in Excel.

import pandas as pd

# Creating a Pandas Series
data = [10, 20, 30, 40]
series = pd.Series(data, index=['A', 'B', 'C', 'D'])
print(series)

📌 Key Features:
✔ Like a NumPy array but with index labels
✔ Supports mathematical operations
✔ Used for handling single columns of data


B. DataFrame (2D Data Structure)

A DataFrame is a tabular, two-dimensional structure (rows & columns).

# Creating a Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 70000]}
df = pd.DataFrame(data)
print(df)

📌 Key Features:
✔ Rows and columns with labels (index & column names)
✔ Can store different data types (int, float, string, etc.)
✔ Supports data filtering, aggregation, and transformation


4. Loading and Exporting Data

A. Loading Data from Files

📌 Read CSV (Comma-Separated Values)

df = pd.read_csv("data.csv")
print(df.head())  # Show first 5 rows

📌 Read Excel File

df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
print(df.head())

📌 Read JSON File

df = pd.read_json("data.json")

📌 Read SQL Database

import sqlite3

conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM employees", conn)

B. Exporting Data to Files

📌 Save as CSV

df.to_csv("output.csv", index=False)

📌 Save as Excel

df.to_excel("output.xlsx", sheet_name="Sheet1", index=False)

📌 Save as JSON

df.to_json("output.json")

5. Exploring and Understanding Data

# Display first & last rows
print(df.head())  # First 5 rows
print(df.tail())  # Last 5 rows

# Get DataFrame info
print(df.info())

# Summary statistics
print(df.describe())

# Check shape (rows, columns)
print(df.shape)

# Check column names
print(df.columns)

# Check data types
print(df.dtypes)

📌 Key Insights:
describe() → Shows statistical summary (mean, std, min, max, etc.)
info() → Displays memory usage & data types
columns → Lists all column names


6. Selecting & Filtering Data

A. Selecting Columns

# Select a single column
print(df["Name"])

# Select multiple columns
print(df[["Name", "Salary"]])

B. Selecting Rows Using iloc & loc

# Select rows by index position
print(df.iloc[0])  # First row
print(df.iloc[1:3])  # Rows from index 1 to 2

# Select rows by index label
print(df.loc[0])  # First row
print(df.loc[df["Age"] > 30])  # Filter rows where Age > 30

📌 Key Differences:
iloc[] → Selects by position (numeric index)
loc[] → Selects by label (row/column names)


7. Handling Missing Data

A. Detecting Missing Values

print(df.isnull().sum())  # Count missing values in each column

B. Filling Missing Values

df.fillna(0, inplace=True)  # Replace NaN with 0
df.fillna(df.mean(), inplace=True)  # Replace NaN with column mean

C. Dropping Missing Values

df.dropna(inplace=True)  # Remove rows with NaN values

📌 Best Practices:
✔ Use fillna() when you need to replace missing values
✔ Use dropna() if you want to remove missing rows


8. Sorting & Grouping Data

A. Sorting Data

df.sort_values(by="Salary", ascending=False, inplace=True)

B. Grouping Data

grouped = df.groupby("Department")["Salary"].mean()
print(grouped)

📌 Use Cases:
sort_values() → Sorts data by column values
groupby() → Aggregates data (sum, mean, count, etc.)


9. Applying Functions to Data

A. Using apply() for Column-Wise Operations

df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)  # Increase salary by 10%

B. Using map() for Single Column Transformation

df["Name"] = df["Name"].map(str.upper)  # Convert names to uppercase

📌 Use Cases:
apply() → Used for complex row/column transformations
map() → Used for simple transformations


10. Merging, Joining & Concatenating Data

A. Merging Two DataFrames (SQL-Style Joins)

merged_df = pd.merge(df1, df2, on="EmployeeID", how="inner")

📌 Join Types:
inner → Common records only
left → Keep all records from df1
right → Keep all records from df2
outer → Keep all records from both


B. Concatenating Multiple DataFrames

concat_df = pd.concat([df1, df2], axis=0)  # Stack rows

📌 Use Cases:
Merging → Combine data from different sources
Concatenation → Append new data


11. Summary

Pandas is essential for data manipulation & analysis
DataFrames provide an efficient way to handle structured data
Supports filtering, grouping, sorting, and transformation
Used for machine learning preprocessing and business analytics

📌 Next Steps:
✅ Learn Data Visualization with Pandas & Matplotlib
✅ Use Pandas for Machine Learning Preprocessing
✅ Work on real-world datasets (Kaggle, UCI, etc.)

Need hands-on Pandas projects? Let me know!

Leave a Reply

Your email address will not be published. Required fields are marked *