Web scraping becomes challenging when websites load content dynamically using JavaScript. Selenium is a powerful tool that automates web browsers, allowing us to interact with web pages just like a human user. It enables us to scrape dynamic content that BeautifulSoup alone cannot handle.
What You’ll Learn:
✔ Setting up Selenium
✔ Navigating web pages
✔ Handling dynamic content
✔ Extracting data
✔ Best practices
2. Installing Selenium
To use Selenium, install it along with a web driver (e.g., ChromeDriver for Google Chrome).
2.1 Install Selenium in Python
pip install selenium
2.2 Download ChromeDriver
- Go to ChromeDriver.
- Download the version matching your Chrome browser.
- Place the
chromedriver.exe
in a known directory.
3. Setting Up Selenium
3.1 Open a Web Page
from selenium import webdriver
# Set up Chrome WebDriver
driver = webdriver.Chrome()
# Open a website
driver.get("https://example.com")
# Print page title
print(driver.title)
# Close the browser
driver.quit()
✅ webdriver.Chrome()
→ Opens a Chrome browser.
✅ .get(url)
→ Navigates to the given website.
✅ .quit()
→ Closes the browser after scraping.
4. Extracting Data from Dynamic Websites
Websites often load data dynamically. Selenium allows us to wait for elements before extracting data.
4.1 Locating Elements
Selenium uses different methods to find elements:
Method | Example |
---|---|
find_element_by_id("element_id") | Find element by ID |
find_element_by_class_name("class_name") | Find by class name |
find_element_by_tag_name("h1") | Find by tag name |
find_element_by_xpath("//h1") | Find using XPath |
Extract Page Title
title = driver.find_element("tag name", "h1").text
print(title)
Extract All Links on a Page
links = driver.find_elements("tag name", "a")
for link in links:
print(link.get_attribute("href"))
5. Handling Dynamic Content (Wait for Elements)
Some elements take time to load. Selenium provides explicit waits.
5.1 Using Explicit Waits
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait for an element to load
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "dynamic_element_id"))
)
print(element.text)
✅ WebDriverWait(driver, 10).until(...)
→ Waits for an element up to 10 seconds.
6. Interacting with Web Pages
Selenium can interact with elements like buttons, forms, and dropdowns.
6.1 Clicking a Button
button = driver.find_element("id", "submit_button")
button.click()
6.2 Filling Out a Form
input_box = driver.find_element("name", "username")
input_box.send_keys("my_username")
6.3 Handling Dropdowns
from selenium.webdriver.support.ui import Select
dropdown = Select(driver.find_element("id", "dropdown_id"))
dropdown.select_by_visible_text("Option 1")
7. Handling Infinite Scrolling
Some websites load content continuously as you scroll.
7.1 Scroll to Bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
7.2 Scroll and Extract Data
import time
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
8. Saving Scraped Data
8.1 Save to CSV
import csv
data = [["Title", "URL"], ["Example", "https://example.com"]]
with open("scraped_data.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerows(data)
8.2 Save to JSON
import json
data = {"title": "Example", "url": "https://example.com"}
with open("scraped_data.json", "w") as file:
json.dump(data, file, indent=4)
9. Best Practices for Web Scraping with Selenium
✔ Use Explicit Waits – Avoid time.sleep()
where possible.
✔ Minimize Headless Browsing – Some sites detect Selenium use.
✔ Rotate User Agents – Mimic real users with headers.
✔ Respect robots.txt
– Follow site scraping policies.
✔ Use Proxies – Prevent IP bans on large-scale scraping.
Using Headless Mode for Faster Scraping
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
print(driver.title)
driver.quit()
Runs Selenium without opening a visible browser.