Web scraping is the process of extracting data from websites. BeautifulSoup is a Python library that makes it easy to parse HTML and XML documents, allowing us to extract information quickly.
This guide will cover:
- Installing BeautifulSoup
- Fetching web pages using
requests
- Parsing and extracting data
- Handling dynamic content
2. Installing Dependencies
Before starting, install the required libraries:
pip install requests beautifulsoup4 lxml
requests
→ Sends HTTP requests to fetch web pages.beautifulsoup4
→ Parses HTML and XML.lxml
→ Speeds up parsing.
3. Fetching a Web Page
We use the requests
module to fetch a webpage:
import requests
url = "https://example.com"
response = requests.get(url)
# Print the HTML content
print(response.text)
.get(url)
→ Sends a request to the URL..text
→ Returns the HTML content.
4. Parsing HTML with BeautifulSoup
Once we have the HTML, we use BeautifulSoup to parse it.
from bs4 import BeautifulSoup
# Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Print formatted HTML
print(soup.prettify())
"html.parser"
→ Default parser for BeautifulSoup.
5. Extracting Data from HTML
BeautifulSoup provides various methods to extract data.
5.1 Extracting the Page Title
print(soup.title.text)
5.2 Extracting All Links (<a>
Tags)
for link in soup.find_all("a"):
print(link.get("href"))
5.3 Extracting All Headings
for heading in soup.find_all(["h1", "h2", "h3"]):
print(heading.text.strip())
5.4 Extracting a Specific Element by Class
element = soup.find("div", class_="example-class")
print(element.text.strip())
5.5 Extracting a Table
table = soup.find("table")
rows = table.find_all("tr")
for row in rows:
columns = row.find_all("td")
print([col.text.strip() for col in columns])
6. Handling Dynamic Content
BeautifulSoup works well for static pages. For JavaScript-rendered content, use Selenium.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://example.com")
soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.prettify())
driver.quit()
7. Storing Scraped Data
7.1 Save to CSV
import csv
data = [["Title", "Link"], ["Example", "https://example.com"]]
with open("data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerows(data)
7.2 Save to JSON
import json
data = {"title": "Example", "link": "https://example.com"}
with open("data.json", "w") as file:
json.dump(data, file, indent=4)
8. Best Practices
✔ Respect robots.txt
– Check site rules before scraping.
✔ Use Headers – Mimic a real browser to avoid blocking.
✔ Limit Requests – Avoid overloading websites.
✔ Handle Exceptions – Use try-except
to handle errors.
Example:
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)