Web Scraping with BeautifulSoup

Loading

Web scraping is the process of extracting data from websites. BeautifulSoup is a Python library that makes it easy to parse HTML and XML documents, allowing us to extract information quickly.

This guide will cover:

  • Installing BeautifulSoup
  • Fetching web pages using requests
  • Parsing and extracting data
  • Handling dynamic content

2. Installing Dependencies

Before starting, install the required libraries:

pip install requests beautifulsoup4 lxml
  • requests → Sends HTTP requests to fetch web pages.
  • beautifulsoup4 → Parses HTML and XML.
  • lxml → Speeds up parsing.

3. Fetching a Web Page

We use the requests module to fetch a webpage:

import requests

url = "https://example.com"
response = requests.get(url)

# Print the HTML content
print(response.text)
  • .get(url) → Sends a request to the URL.
  • .text → Returns the HTML content.

4. Parsing HTML with BeautifulSoup

Once we have the HTML, we use BeautifulSoup to parse it.

from bs4 import BeautifulSoup

# Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")

# Print formatted HTML
print(soup.prettify())
  • "html.parser" → Default parser for BeautifulSoup.

5. Extracting Data from HTML

BeautifulSoup provides various methods to extract data.

5.1 Extracting the Page Title

print(soup.title.text)

5.2 Extracting All Links (<a> Tags)

for link in soup.find_all("a"):
print(link.get("href"))

5.3 Extracting All Headings

for heading in soup.find_all(["h1", "h2", "h3"]):
print(heading.text.strip())

5.4 Extracting a Specific Element by Class

element = soup.find("div", class_="example-class")
print(element.text.strip())

5.5 Extracting a Table

table = soup.find("table")
rows = table.find_all("tr")

for row in rows:
columns = row.find_all("td")
print([col.text.strip() for col in columns])

6. Handling Dynamic Content

BeautifulSoup works well for static pages. For JavaScript-rendered content, use Selenium.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://example.com")

soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.prettify())

driver.quit()

7. Storing Scraped Data

7.1 Save to CSV

import csv

data = [["Title", "Link"], ["Example", "https://example.com"]]

with open("data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerows(data)

7.2 Save to JSON

import json

data = {"title": "Example", "link": "https://example.com"}

with open("data.json", "w") as file:
json.dump(data, file, indent=4)

8. Best Practices

Respect robots.txt – Check site rules before scraping.
Use Headers – Mimic a real browser to avoid blocking.
Limit Requests – Avoid overloading websites.
Handle Exceptions – Use try-except to handle errors.

Example:

headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)

Leave a Reply

Your email address will not be published. Required fields are marked *