Scraping JavaScript-Rendered Content

Loading

Many modern websites use JavaScript to load content dynamically. Traditional web scraping methods using BeautifulSoup or requests fail to extract such content because they only retrieve the initial HTML page, not the JavaScript-rendered data.

Solution: Use tools that can render JavaScript before extracting data, such as:
Selenium – Automates a real browser
Playwright – Fast, headless browser automation
Scrapy-Selenium – Integrates Selenium into Scrapy
Undetected-Chromedriver – Avoids bot detection


2. Identifying JavaScript-Rendered Content

Before scraping, check if the content is JavaScript-rendered:
Right-click > View Page Source – If data is missing but visible in the browser, it is loaded via JavaScript.
Inspect Element (DevTools, F12) – Check the Network > XHR/Fetch tab to find API calls returning data.
Disable JavaScript – If content disappears, it is JavaScript-dependent.


3. Scraping JavaScript Content with Selenium

Selenium automates web browsers to load JavaScript-rendered content before scraping.

3.1. Install Selenium & WebDriver

pip install selenium

Download Chrome WebDriver from ChromeDriver and place it in your system’s PATH.

3.2. Basic Selenium Scraper

from selenium import webdriver

driver = webdriver.Chrome() # Launch Chrome
driver.get("https://example.com") # Load website

print(driver.page_source) # Print the fully rendered HTML
driver.quit() # Close browser

3.3. Extracting Specific Elements

Use find_element to locate JavaScript-rendered content.

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://quotes.toscrape.com/js/")

quotes = driver.find_elements(By.CLASS_NAME, "quote") # Find all quote elements

for quote in quotes:
print(quote.text) # Extract text from each quote

driver.quit()

Works for dynamic websites like infinite scrolling pages.


4. Waiting for JavaScript Execution

Some sites take time to load JavaScript content. Use WebDriverWait to wait until elements appear.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://example.com")

# Wait until element with ID "content" is visible
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "content"))
)

print(driver.page_source) # Extract page content
driver.quit()

Why? Prevents errors caused by missing elements when JavaScript content is slow to load.


5. Handling Infinite Scrolling

Websites like Twitter & LinkedIn load data dynamically as you scroll. Use JavaScript to scroll down and load more content.

import time

driver = webdriver.Chrome()
driver.get("https://example.com")

# Scroll multiple times to load new content
for _ in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Wait for new content to load

print(driver.page_source)
driver.quit()

Useful for scraping social media feeds or product listings!


6. Scraping JavaScript Content with Playwright

Playwright is faster than Selenium for headless browser automation.

6.1. Install Playwright

pip install playwright
playwright install

6.2. Playwright Scraper

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch(headless=True) # Run browser in headless mode
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/")

quotes = page.locator(".quote").all() # Get all quote elements
for quote in quotes:
print(quote.inner_text()) # Extract text

browser.close()

Faster & more efficient than Selenium!


7. Extracting Data from JavaScript API Calls

Sometimes, JavaScript loads data from an API. Instead of using Selenium, directly call the API.

7.1. Find API Calls Using DevTools

  1. Open Network Tab (F12) > XHR/Fetch
  2. Identify API request fetching data
  3. Copy the API URL and use requests to extract data

7.2. Example: Extracting Data from an API

import requests

response = requests.get("https://api.example.com/data")
data = response.json() # Convert to Python dictionary

print(data) # Extract key information

Faster than using Selenium!


8. Avoiding Bot Detection

Many JavaScript-rendered websites detect bots and block them.
🔹 Use random headers & user-agents:

headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get("https://example.com", headers=headers)

🔹 Use headless browsers carefully – Some sites detect them.
🔹 Use proxies & rotating IPs – Helps avoid getting blocked.


9. When to Use Different Tools?

MethodBest ForProsCons
SeleniumInteractive pages, Infinite scrollingHandles JavaScript wellSlower, high resource usage
PlaywrightFast headless scrapingBetter performance than SeleniumRequires installation
API RequestsExtracting JSON from APIsFastest, avoids JavaScriptNot all sites expose APIs
Scrapy-SeleniumLarge-scale scrapingScales well with proxiesComplex setup

Leave a Reply

Your email address will not be published. Required fields are marked *