Many modern websites use JavaScript to load content dynamically. Traditional web scraping methods using BeautifulSoup or requests
fail to extract such content because they only retrieve the initial HTML page, not the JavaScript-rendered data.
Solution: Use tools that can render JavaScript before extracting data, such as:
✔ Selenium – Automates a real browser
✔ Playwright – Fast, headless browser automation
✔ Scrapy-Selenium – Integrates Selenium into Scrapy
✔ Undetected-Chromedriver – Avoids bot detection
2. Identifying JavaScript-Rendered Content
Before scraping, check if the content is JavaScript-rendered:
Right-click > View Page Source – If data is missing but visible in the browser, it is loaded via JavaScript.
Inspect Element (DevTools, F12) – Check the Network > XHR/Fetch tab to find API calls returning data.
Disable JavaScript – If content disappears, it is JavaScript-dependent.
3. Scraping JavaScript Content with Selenium
Selenium automates web browsers to load JavaScript-rendered content before scraping.
3.1. Install Selenium & WebDriver
pip install selenium
Download Chrome WebDriver from ChromeDriver and place it in your system’s PATH.
3.2. Basic Selenium Scraper
from selenium import webdriver
driver = webdriver.Chrome() # Launch Chrome
driver.get("https://example.com") # Load website
print(driver.page_source) # Print the fully rendered HTML
driver.quit() # Close browser
3.3. Extracting Specific Elements
Use find_element
to locate JavaScript-rendered content.
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://quotes.toscrape.com/js/")
quotes = driver.find_elements(By.CLASS_NAME, "quote") # Find all quote elements
for quote in quotes:
print(quote.text) # Extract text from each quote
driver.quit()
Works for dynamic websites like infinite scrolling pages.
4. Waiting for JavaScript Execution
Some sites take time to load JavaScript content. Use WebDriverWait to wait until elements appear.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://example.com")
# Wait until element with ID "content" is visible
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "content"))
)
print(driver.page_source) # Extract page content
driver.quit()
Why? Prevents errors caused by missing elements when JavaScript content is slow to load.
5. Handling Infinite Scrolling
Websites like Twitter & LinkedIn load data dynamically as you scroll. Use JavaScript to scroll down and load more content.
import time
driver = webdriver.Chrome()
driver.get("https://example.com")
# Scroll multiple times to load new content
for _ in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Wait for new content to load
print(driver.page_source)
driver.quit()
Useful for scraping social media feeds or product listings!
6. Scraping JavaScript Content with Playwright
Playwright is faster than Selenium for headless browser automation.
6.1. Install Playwright
pip install playwright
playwright install
6.2. Playwright Scraper
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Run browser in headless mode
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/")
quotes = page.locator(".quote").all() # Get all quote elements
for quote in quotes:
print(quote.inner_text()) # Extract text
browser.close()
Faster & more efficient than Selenium!
7. Extracting Data from JavaScript API Calls
Sometimes, JavaScript loads data from an API. Instead of using Selenium, directly call the API.
7.1. Find API Calls Using DevTools
- Open Network Tab (F12) > XHR/Fetch
- Identify API request fetching data
- Copy the API URL and use
requests
to extract data
7.2. Example: Extracting Data from an API
import requests
response = requests.get("https://api.example.com/data")
data = response.json() # Convert to Python dictionary
print(data) # Extract key information
Faster than using Selenium!
8. Avoiding Bot Detection
Many JavaScript-rendered websites detect bots and block them.
🔹 Use random headers & user-agents:
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get("https://example.com", headers=headers)
🔹 Use headless browsers carefully – Some sites detect them.
🔹 Use proxies & rotating IPs – Helps avoid getting blocked.
9. When to Use Different Tools?
Method | Best For | Pros | Cons |
---|---|---|---|
Selenium | Interactive pages, Infinite scrolling | Handles JavaScript well | Slower, high resource usage |
Playwright | Fast headless scraping | Better performance than Selenium | Requires installation |
API Requests | Extracting JSON from APIs | Fastest, avoids JavaScript | Not all sites expose APIs |
Scrapy-Selenium | Large-scale scraping | Scales well with proxies | Complex setup |