Storing Scraped Data in Databases

Loading

After scraping data from websites, the next step is storing it efficiently in a database for future use. Whether you’re handling small datasets or big data, choosing the right database and structure is crucial for scalability and performance.

In this guide, we’ll cover:
Choosing the right database (SQL vs. NoSQL)
Setting up a database
Storing scraped data using Python
Best practices for database storage


2. Choosing the Right Database

There are two main types of databases to store scraped data:

A. SQL Databases (Structured Data)

Best for structured and relational data.
Uses tables, rows, and columns (like Excel).
Popular choices:

  • SQLite (lightweight, local storage)
  • MySQL (scalable, widely used)
  • PostgreSQL (advanced features, high performance)

Use SQL when: Data has clear relationships (e.g., product listings with categories).


B. NoSQL Databases (Unstructured Data)

Best for semi-structured or unstructured data (e.g., JSON, key-value pairs).
Faster for storing large volumes of data without strict schemas.
Popular choices:

  • MongoDB (stores data as JSON-like documents)
  • Firebase (cloud-based NoSQL database)
  • Redis (key-value storage, caching)

Use NoSQL when: Scraping data from sources with flexible or inconsistent structure (e.g., news articles, product descriptions).


3. Storing Data in a SQL Database

Let’s store scraped data in SQLite, a lightweight database that works locally with minimal setup.

A. Installing SQLite

SQLite comes built-in with Python. To use it, install sqlite3:

pip install sqlite3

B. Creating a Database and Table

import sqlite3

# Connect to (or create) database
conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

# Create a table
cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
price TEXT,
url TEXT
)
""")

conn.commit()
conn.close()

C. Storing Scraped Data in SQLite

Assume we have scraped product details:

scraped_data = [
{"name": "Laptop", "price": "$999", "url": "https://example.com/laptop"},
{"name": "Phone", "price": "$699", "url": "https://example.com/phone"}
]

Now, we insert this data into the database:

import sqlite3

def store_data(data):
conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

for item in data:
cursor.execute("INSERT INTO products (name, price, url) VALUES (?, ?, ?)",
(item["name"], item["price"], item["url"]))

conn.commit()
conn.close()

store_data(scraped_data)

Data is now stored in SQLite!


4. Storing Data in a NoSQL Database (MongoDB)

MongoDB stores data in JSON-like documents (ideal for unstructured data).

A. Installing MongoDB and pymongo

pip install pymongo

B. Connecting to MongoDB and Creating a Collection

from pymongo import MongoClient

# Connect to MongoDB (local or cloud)
client = MongoClient("mongodb://localhost:27017/")

# Create database and collection
db = client["scraping_db"]
collection = db["products"]

C. Storing Scraped Data in MongoDB

scraped_data = [
{"name": "Laptop", "price": "$999", "url": "https://example.com/laptop"},
{"name": "Phone", "price": "$699", "url": "https://example.com/phone"}
]

# Insert data into MongoDB
collection.insert_many(scraped_data)
print("Data inserted successfully!")

Data is now stored in MongoDB!


5. Best Practices for Storing Scraped Data

Optimize database performance and avoid common mistakes:

Use Indexing: Speeds up search queries (CREATE INDEX in SQL, _id in MongoDB).
Avoid Duplicates: Check if the record exists before inserting (ON CONFLICT IGNORE in SQL, update_one() in MongoDB).
Store Timestamps: Track when data was collected (DATETIME column in SQL, ISODate() in MongoDB).
Handle Large Data Efficiently:

  • Batch inserts instead of inserting one record at a time.
  • Use LIMIT and pagination when retrieving data.
    Data Cleaning Before Storage: Remove empty fields, trim whitespace, format data properly.

6. Retrieving Data from Databases

A. Fetch Data from SQLite

conn = sqlite3.connect("scraped_data.db")
cursor = conn.cursor()

cursor.execute("SELECT * FROM products")
rows = cursor.fetchall()

for row in rows:
print(row)

conn.close()

B. Fetch Data from MongoDB

data = collection.find()
for item in data:
print(item)

Leave a Reply

Your email address will not be published. Required fields are marked *