Storing Scraped Data in Databases

After scraping data from websites, the next step is storing it efficiently in a database for future use. Whether you’re handling small datasets or big data, choosing the right database and structure is crucial for scalability and performance.

In this guide, we’ll cover:
Choosing the right database (SQL vs. NoSQL)
Setting up a database
Storing scraped data using Python
Best practices for database storage

2. Choosing the Right Database

There are two main types of databases to store scraped data:

A. SQL Databases (Structured Data)

Best for structured and relational data.
Uses tables, rows, and columns (like Excel).
Popular choices:

SQLite (lightweight, local storage)
MySQL (scalable, widely used)
PostgreSQL (advanced features, high performance)

✔ Use SQL when: Data has clear relationships (e.g., product listings with categories).

B. NoSQL Databases (Unstructured Data)

Best for semi-structured or unstructured data (e.g., JSON, key-value pairs).
Faster for storing large volumes of data without strict schemas.
Popular choices:

MongoDB (stores data as JSON-like documents)
Firebase (cloud-based NoSQL database)
Redis (key-value storage, caching)

✔ Use NoSQL when: Scraping data from sources with flexible or inconsistent structure (e.g., news articles, product descriptions).

3. Storing Data in a SQL Database

Let’s store scraped data in SQLite, a lightweight database that works locally with minimal setup.

A. Installing SQLite

SQLite comes built-in with Python. To use it, install sqlite3:

pip install sqlite3

B. Creating a Database and Table

import sqlite3

# Connect to (or create) database
conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT,
    price TEXT,
    url TEXT
)
""")

conn.commit()
conn.close()

C. Storing Scraped Data in SQLite

Assume we have scraped product details:

scraped_data = [
    {"name": "Laptop", "price": "$999", "url": "https://example.com/laptop"},
    {"name": "Phone", "price": "$699", "url": "https://example.com/phone"}
]

Now, we insert this data into the database:

import sqlite3

def store_data(data):
    conn = sqlite3.connect("scraped_data.db")
    cursor = conn.cursor()

    for item in data:
        cursor.execute("INSERT INTO products (name, price, url) VALUES (?, ?, ?)",
                       (item["name"], item["price"], item["url"]))

    conn.commit()
    conn.close()

store_data(scraped_data)

Data is now stored in SQLite!

4. Storing Data in a NoSQL Database (MongoDB)

MongoDB stores data in JSON-like documents (ideal for unstructured data).

A. Installing MongoDB and pymongo

pip install pymongo

B. Connecting to MongoDB and Creating a Collection

from pymongo import MongoClient

# Connect to MongoDB (local or cloud)
client = MongoClient("mongodb://localhost:27017/")

# Create database and collection
db = client["scraping_db"]
collection = db["products"]

C. Storing Scraped Data in MongoDB

scraped_data = [
    {"name": "Laptop", "price": "$999", "url": "https://example.com/laptop"},
    {"name": "Phone", "price": "$699", "url": "https://example.com/phone"}
]

# Insert data into MongoDB
collection.insert_many(scraped_data)
print("Data inserted successfully!")

Data is now stored in MongoDB!

5. Best Practices for Storing Scraped Data

Optimize database performance and avoid common mistakes:

Use Indexing: Speeds up search queries (CREATE INDEX in SQL, _id in MongoDB).
Avoid Duplicates: Check if the record exists before inserting (ON CONFLICT IGNORE in SQL, update_one() in MongoDB).
Store Timestamps: Track when data was collected (DATETIME column in SQL, ISODate() in MongoDB).
Handle Large Data Efficiently:

Batch inserts instead of inserting one record at a time.
Use LIMIT and pagination when retrieving data.
Data Cleaning Before Storage: Remove empty fields, trim whitespace, format data properly.

6. Retrieving Data from Databases

A. Fetch Data from SQLite

conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

cursor.execute("SELECT * FROM products")
rows = cursor.fetchall()

for row in rows:
    print(row)

conn.close()

B. Fetch Data from MongoDB

data = collection.find()
for item in data:
    print(item)