APIs (Application Programming Interfaces) allow applications to communicate and exchange data efficiently. For data extraction, APIs provide structured access to real-time and large datasets from various sources like social media, finance, weather, and e-commerce.
Why Use APIs for Data Extraction?
β Faster & More Reliable than Web Scraping
β Provides Structured Data (JSON/XML)
β Avoids Legal & Ethical Issues
β Access to Real-time & Historical Data
2. Understanding API Types
Before extracting data, it’s essential to understand the different API types:
API Type | Description | Example |
---|---|---|
REST API | Uses HTTP methods (GET, POST, etc.) and returns JSON/XML | Twitter API, OpenWeatherMap |
SOAP API | Uses XML-based messaging | Payment gateways, Banking APIs |
GraphQL API | Client requests only specific fields needed | GitHub API, Shopify API |
WebSocket API | Provides real-time data streaming | Binance API (crypto), Stock APIs |
Most modern APIs use REST or GraphQL.
3. Setting Up API Requests in Python
3.1. Using the requests
Library
Install requests
if not already installed:
pip install requests
3.2. Sending a Simple API Request
Example: Fetching weather data from OpenWeatherMap API
import requests
API_KEY = "your_api_key"
url = f"https://api.openweathermap.org/data/2.5/weather?q=London&appid={API_KEY}"
response = requests.get(url)
if response.status_code == 200:
data = response.json()
print(data) # Display JSON data
else:
print("Error:", response.status_code)
πΉ Key Points:
- Uses GET request to retrieve data
- API key authentication is required
response.json()
converts API response into a Python dictionary
4. Handling Authentication
Most APIs require authentication to prevent misuse.
4.1. API Key Authentication
Common method using headers or query parameters.
headers = {"Authorization": "Bearer your_api_key"}
response = requests.get("https://api.example.com/data", headers=headers)
4.2. OAuth 2.0 Authentication
Used for APIs like Twitter, GitHub, and Google APIs.
Example: Fetching GitHub user details using OAuth Token
headers = {"Authorization": "token your_oauth_token"}
response = requests.get("https://api.github.com/user", headers=headers)
print(response.json())
5. Extracting and Processing API Data
5.1. Handling JSON Responses
Most APIs return data in JSON format. Extracting specific fields:
response = requests.get("https://api.github.com/users/octocat")
data = response.json()
print("Username:", data["login"])
print("Public Repos:", data["public_repos"])
5.2. Handling Errors Gracefully
Common API Errors & Fixes
Status Code | Meaning | Solution |
---|---|---|
200 | Success | Everything is fine |
400 | Bad Request | Check request parameters |
401 | Unauthorized | Check API key/authentication |
403 | Forbidden | Insufficient permissions |
429 | Too Many Requests | Implement rate limiting |
500+ | Server Error | Try again later |
Example: Handling errors in API requests
response = requests.get("https://api.example.com/data")
if response.status_code == 200:
data = response.json()
elif response.status_code == 401:
print("Unauthorized! Check your API key.")
elif response.status_code == 429:
print("Rate limit exceeded! Try again later.")
else:
print("API Error:", response.status_code)
6. API Pagination for Large Datasets
APIs limit data per request (e.g., 100 results per page).
Solution: Use pagination to fetch more data.
6.1. Handling Pagination with page
Parameter
Example: Fetching multiple pages of data
import requests
API_URL = "https://api.example.com/data"
all_data = []
page = 1
while True:
response = requests.get(API_URL, params={"page": page})
data = response.json()
if not data: # Stop when no more data
break
all_data.extend(data)
page += 1 # Go to next page
print("Total items:", len(all_data))
6.2. Handling next
Links in Pagination
Some APIs return a “next” URL in responses:
import requests
API_URL = "https://api.example.com/data"
data_list = []
while API_URL:
response = requests.get(API_URL)
data = response.json()
data_list.extend(data["results"])
API_URL = data.get("next") # Get next page URL
print("Total records:", len(data_list))
Efficient for large datasets!
7. Rate Limiting & Throttling
APIs limit requests to prevent overload.
7.1. Handling API Rate Limits
- Check X-RateLimit headers in responses.
- Add delays to prevent exceeding limits.
import time
for i in range(10): # Example loop
response = requests.get("https://api.example.com/data")
if response.status_code == 429:
print("Rate limit hit! Waiting...")
time.sleep(60) # Wait before retrying
else:
print(response.json())
8. Storing API Data in Databases
Extracted data is often stored in CSV, JSON, or databases.
8.1. Save API Data as a CSV
import csv
data = [{"id": 1, "name": "John"}, {"id": 2, "name": "Alice"}]
with open("data.csv", "w", newline="") as file:
writer = csv.DictWriter(file, fieldnames=["id", "name"])
writer.writeheader()
writer.writerows(data)
8.2. Save API Data to a Database (SQLite)
import sqlite3
conn = sqlite3.connect("data.db")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS users (id INTEGER, name TEXT)")
data = [(1, "John"), (2, "Alice")]
cursor.executemany("INSERT INTO users VALUES (?, ?)", data)
conn.commit()
conn.close()
Efficient for handling large API data!
9. Working with Real APIs (Examples)
β Twitter API (Tweepy) β Fetch tweets
β Google Maps API β Get locations
β Alpha Vantage API β Stock market data
β NASA API β Space images
β OpenWeatherMap API β Weather updates
Example: Fetching NASAβs Astronomy Picture of the Day
API_KEY = "DEMO_KEY"
url = f"https://api.nasa.gov/planetary/apod?api_key={API_KEY}"
response = requests.get(url)
data = response.json()
print("Title:", data["title"])
print("URL:", data["url"])