Using Scrapy for Web Scraping

Loading

Scrapy is a powerful Python framework for web scraping that allows developers to extract data from websites efficiently. Unlike BeautifulSoup and Selenium, Scrapy is designed for large-scale web scraping and crawling, handling requests asynchronously for better performance.

Why Use Scrapy?

✔ Fast and asynchronous requests
✔ Built-in handling of pagination and crawling
✔ Supports data export (JSON, CSV, XML)
✔ Follows website rules (robots.txt)


2. Installing Scrapy

Before using Scrapy, install it using pip:

pip install scrapy

Check the installation by running:

scrapy version

3. Creating a Scrapy Project

To start a Scrapy project, run:

scrapy startproject my_scraper

This creates a project structure:

my_scraper/
│── my_scraper/
│ │── spiders/ # Place to write crawlers
│ │── items.py # Define data structures
│ │── middlewares.py # Request handling
│ │── pipelines.py # Post-processing
│── scrapy.cfg # Scrapy configuration

4. Creating a Scrapy Spider

Navigate to the spiders/ folder and create a new spider:

cd my_scraper
scrapy genspider example example.com

This generates example.py inside spiders/ with the following content:

import scrapy

class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]

def parse(self, response):
self.log(f"Visited: {response.url}")

5. Extracting Data with Scrapy

Modify parse() to extract data using CSS selectors or XPath.

5.1 Extracting Page Title

def parse(self, response):
title = response.css("title::text").get()
yield {"title": title}

5.2 Extracting Links

def parse(self, response):
for link in response.css("a::attr(href)").getall():
yield {"link": link}

5.3 Extracting Headlines

def parse(self, response):
for headline in response.css("h1::text, h2::text").getall():
yield {"headline": headline}

6. Running the Spider

To run the spider and save the output in JSON format:

scrapy crawl example -o output.json

For CSV output:

scrapy crawl example -o output.csv

7. Handling Pagination

To scrape multiple pages, Scrapy follows pagination links:

def parse(self, response):
for article in response.css(".article"):
yield {
"title": article.css("h2::text").get(),
"link": article.css("a::attr(href)").get()
}

next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)

8. Exporting Data

Scrapy supports multiple formats:

FormatCommand
JSONscrapy crawl example -o data.json
CSVscrapy crawl example -o data.csv
XMLscrapy crawl example -o data.xml

9. Using Scrapy Shell for Testing

Scrapy provides an interactive shell to test selectors:

scrapy shell "https://example.com"

Example usage inside Scrapy shell:

response.css("title::text").get()
response.xpath("//h1/text()").get()

10. Storing Data in a Database

Modify pipelines.py to save data to a SQLite database:

import sqlite3

class SQLitePipeline:
def open_spider(self, spider):
self.conn = sqlite3.connect("scraped_data.db")
self.cursor = self.conn.cursor()
self.cursor.execute("CREATE TABLE IF NOT EXISTS data (title TEXT, link TEXT)")

def process_item(self, item, spider):
self.cursor.execute("INSERT INTO data VALUES (?, ?)", (item["title"], item["link"]))
self.conn.commit()
return item

def close_spider(self, spider):
self.conn.close()

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
"my_scraper.pipelines.SQLitePipeline": 300,
}

11. Handling JavaScript with Scrapy-Selenium

For JavaScript-loaded content, install scrapy-selenium:

pip install scrapy-selenium

Modify middlewares.py:

from scrapy_selenium import SeleniumRequest

class JSPageSpider(scrapy.Spider):
name = "js_spider"

def start_requests(self):
yield SeleniumRequest(url="https://example.com", callback=self.parse)

def parse(self, response):
title = response.css("title::text").get()
yield {"title": title}

Enable Selenium middleware in settings.py:

DOWNLOADER_MIDDLEWARES = {
"scrapy_selenium.SeleniumMiddleware": 800
}

12. Best Practices for Scrapy

Respect robots.txt – Set ROBOTSTXT_OBEY = True in settings.py.
Use User-Agent – Prevent getting blocked:

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

Enable Logging – Helps debug issues:

import logging
logging.basicConfig(level=logging.INFO)

Limit Requests – Reduce server load:

DOWNLOAD_DELAY = 2

Rotate Proxies – Prevent IP bans using Scrapy-Rotating-Proxy:

pip install scrapy-rotating-proxies

Add in settings.py:

ROTATING_PROXY_LIST = ["proxy1:port", "proxy2:port"]

Leave a Reply

Your email address will not be published. Required fields are marked *