Scrapy is a powerful Python framework for web scraping that allows developers to extract data from websites efficiently. Unlike BeautifulSoup and Selenium, Scrapy is designed for large-scale web scraping and crawling, handling requests asynchronously for better performance.
Why Use Scrapy?
✔ Fast and asynchronous requests
✔ Built-in handling of pagination and crawling
✔ Supports data export (JSON, CSV, XML)
✔ Follows website rules (robots.txt
)
2. Installing Scrapy
Before using Scrapy, install it using pip:
pip install scrapy
Check the installation by running:
scrapy version
3. Creating a Scrapy Project
To start a Scrapy project, run:
scrapy startproject my_scraper
This creates a project structure:
my_scraper/
│── my_scraper/
│ │── spiders/ # Place to write crawlers
│ │── items.py # Define data structures
│ │── middlewares.py # Request handling
│ │── pipelines.py # Post-processing
│── scrapy.cfg # Scrapy configuration
4. Creating a Scrapy Spider
Navigate to the spiders/
folder and create a new spider:
cd my_scraper
scrapy genspider example example.com
This generates example.py
inside spiders/
with the following content:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["https://example.com"]
def parse(self, response):
self.log(f"Visited: {response.url}")
5. Extracting Data with Scrapy
Modify parse()
to extract data using CSS selectors or XPath.
5.1 Extracting Page Title
def parse(self, response):
title = response.css("title::text").get()
yield {"title": title}
5.2 Extracting Links
def parse(self, response):
for link in response.css("a::attr(href)").getall():
yield {"link": link}
5.3 Extracting Headlines
def parse(self, response):
for headline in response.css("h1::text, h2::text").getall():
yield {"headline": headline}
6. Running the Spider
To run the spider and save the output in JSON format:
scrapy crawl example -o output.json
For CSV output:
scrapy crawl example -o output.csv
7. Handling Pagination
To scrape multiple pages, Scrapy follows pagination links:
def parse(self, response):
for article in response.css(".article"):
yield {
"title": article.css("h2::text").get(),
"link": article.css("a::attr(href)").get()
}
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
8. Exporting Data
Scrapy supports multiple formats:
Format | Command |
---|---|
JSON | scrapy crawl example -o data.json |
CSV | scrapy crawl example -o data.csv |
XML | scrapy crawl example -o data.xml |
9. Using Scrapy Shell for Testing
Scrapy provides an interactive shell to test selectors:
scrapy shell "https://example.com"
Example usage inside Scrapy shell:
response.css("title::text").get()
response.xpath("//h1/text()").get()
10. Storing Data in a Database
Modify pipelines.py
to save data to a SQLite database:
import sqlite3
class SQLitePipeline:
def open_spider(self, spider):
self.conn = sqlite3.connect("scraped_data.db")
self.cursor = self.conn.cursor()
self.cursor.execute("CREATE TABLE IF NOT EXISTS data (title TEXT, link TEXT)")
def process_item(self, item, spider):
self.cursor.execute("INSERT INTO data VALUES (?, ?)", (item["title"], item["link"]))
self.conn.commit()
return item
def close_spider(self, spider):
self.conn.close()
Enable the pipeline in settings.py
:
ITEM_PIPELINES = {
"my_scraper.pipelines.SQLitePipeline": 300,
}
11. Handling JavaScript with Scrapy-Selenium
For JavaScript-loaded content, install scrapy-selenium:
pip install scrapy-selenium
Modify middlewares.py
:
from scrapy_selenium import SeleniumRequest
class JSPageSpider(scrapy.Spider):
name = "js_spider"
def start_requests(self):
yield SeleniumRequest(url="https://example.com", callback=self.parse)
def parse(self, response):
title = response.css("title::text").get()
yield {"title": title}
Enable Selenium middleware in settings.py
:
DOWNLOADER_MIDDLEWARES = {
"scrapy_selenium.SeleniumMiddleware": 800
}
12. Best Practices for Scrapy
✔ Respect robots.txt
– Set ROBOTSTXT_OBEY = True
in settings.py
.
✔ Use User-Agent – Prevent getting blocked:
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
✔ Enable Logging – Helps debug issues:
import logging
logging.basicConfig(level=logging.INFO)
✔ Limit Requests – Reduce server load:
DOWNLOAD_DELAY = 2
✔ Rotate Proxies – Prevent IP bans using Scrapy-Rotating-Proxy:
pip install scrapy-rotating-proxies
Add in settings.py
:
ROTATING_PROXY_LIST = ["proxy1:port", "proxy2:port"]