Scraping Dynamic Websites with Selenium

Loading

Web scraping becomes challenging when websites load content dynamically using JavaScript. Selenium is a powerful tool that automates web browsers, allowing us to interact with web pages just like a human user. It enables us to scrape dynamic content that BeautifulSoup alone cannot handle.

What You’ll Learn:

✔ Setting up Selenium
✔ Navigating web pages
✔ Handling dynamic content
✔ Extracting data
✔ Best practices


2. Installing Selenium

To use Selenium, install it along with a web driver (e.g., ChromeDriver for Google Chrome).

2.1 Install Selenium in Python

pip install selenium

2.2 Download ChromeDriver

  1. Go to ChromeDriver.
  2. Download the version matching your Chrome browser.
  3. Place the chromedriver.exe in a known directory.

3. Setting Up Selenium

3.1 Open a Web Page

from selenium import webdriver

# Set up Chrome WebDriver
driver = webdriver.Chrome()

# Open a website
driver.get("https://example.com")

# Print page title
print(driver.title)

# Close the browser
driver.quit()

webdriver.Chrome() → Opens a Chrome browser.
.get(url) → Navigates to the given website.
.quit() → Closes the browser after scraping.


4. Extracting Data from Dynamic Websites

Websites often load data dynamically. Selenium allows us to wait for elements before extracting data.

4.1 Locating Elements

Selenium uses different methods to find elements:

MethodExample
find_element_by_id("element_id")Find element by ID
find_element_by_class_name("class_name")Find by class name
find_element_by_tag_name("h1")Find by tag name
find_element_by_xpath("//h1")Find using XPath

Extract Page Title

title = driver.find_element("tag name", "h1").text
print(title)

Extract All Links on a Page

links = driver.find_elements("tag name", "a")
for link in links:
print(link.get_attribute("href"))

5. Handling Dynamic Content (Wait for Elements)

Some elements take time to load. Selenium provides explicit waits.

5.1 Using Explicit Waits

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Wait for an element to load
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dynamic_element_id"))
)
print(element.text)

WebDriverWait(driver, 10).until(...) → Waits for an element up to 10 seconds.


6. Interacting with Web Pages

Selenium can interact with elements like buttons, forms, and dropdowns.

6.1 Clicking a Button

button = driver.find_element("id", "submit_button")
button.click()

6.2 Filling Out a Form

input_box = driver.find_element("name", "username")
input_box.send_keys("my_username")

6.3 Handling Dropdowns

from selenium.webdriver.support.ui import Select

dropdown = Select(driver.find_element("id", "dropdown_id"))
dropdown.select_by_visible_text("Option 1")

7. Handling Infinite Scrolling

Some websites load content continuously as you scroll.

7.1 Scroll to Bottom

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

7.2 Scroll and Extract Data

import time

last_height = driver.execute_script("return document.body.scrollHeight")

while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)

new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height

8. Saving Scraped Data

8.1 Save to CSV

import csv

data = [["Title", "URL"], ["Example", "https://example.com"]]

with open("scraped_data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerows(data)

8.2 Save to JSON

import json

data = {"title": "Example", "url": "https://example.com"}

with open("scraped_data.json", "w") as file:
json.dump(data, file, indent=4)

9. Best Practices for Web Scraping with Selenium

Use Explicit Waits – Avoid time.sleep() where possible.
Minimize Headless Browsing – Some sites detect Selenium use.
Rotate User Agents – Mimic real users with headers.
Respect robots.txt – Follow site scraping policies.
Use Proxies – Prevent IP bans on large-scale scraping.

Using Headless Mode for Faster Scraping

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
print(driver.title)
driver.quit()

Runs Selenium without opening a visible browser.

Leave a Reply

Your email address will not be published. Required fields are marked *