Handling CAPTCHAs in Scraping

Loading

CAPTCHAs (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are security measures used to prevent automated bots from accessing websites. They can be text-based, image-based, audio-based, or interactive puzzles.

Common CAPTCHA Types in Scraping

  1. Text-based CAPTCHAs (Distorted letters/numbers)
  2. Image-based CAPTCHAs (Select objects in images)
  3. reCAPTCHA v2 & v3 (Google’s advanced bot detection)
  4. hCaptcha (Similar to reCAPTCHA, used by Cloudflare)
  5. Slider CAPTCHAs (Drag the slider to fit an image)

2. Strategies to Handle CAPTCHAs

There are multiple ways to bypass CAPTCHAs in web scraping:

MethodEffectivenessDifficultyUsage
Using Anti-CAPTCHA Services High MediumFor large-scale scraping
Avoiding CAPTCHA Triggers High EasyPrevents triggering CAPTCHAs
Manual CAPTCHA Solving High Time-consumingFor small-scale scraping
Browser Automation (Selenium) Medium SlowWorks for interactive CAPTCHAs
OCR-based Solving (Tesseract) Low MediumFor text-based CAPTCHAs

3. Using CAPTCHA Solving Services

Paid CAPTCHA-solving services can automate the process. Some popular ones include:

Example: Solving reCAPTCHA with 2Captcha

  1. Sign up at 2Captcha and get an API key.
  2. Install the Python library: bashCopyEditpip install requests
  3. Solve CAPTCHA using API:
import requests
import time

API_KEY = "your_2captcha_api_key"
site_key = "6Lc_aX0UAAAAANNoYtspdmEN0TkWgxJ_Wy3s9N_v" # Example reCAPTCHA site key
url = "https://example.com"

# Step 1: Send CAPTCHA solving request
captcha_request = requests.get(f"http://2captcha.com/in.php?key={API_KEY}&method=userrecaptcha&googlekey={site_key}&pageurl={url}")
captcha_id = captcha_request.text.split('|')[1]

# Step 2: Wait for the CAPTCHA to be solved
time.sleep(15) # Give some time for solving
captcha_response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")

while "CAPCHA_NOT_READY" in captcha_response.text:
time.sleep(5)
captcha_response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")

captcha_solution = captcha_response.text.split('|')[1]

# Step 3: Submit the CAPTCHA solution with your request
response = requests.post(url, data={"g-recaptcha-response": captcha_solution})
print(response.text)

4. Avoiding CAPTCHA Triggers

To prevent triggering CAPTCHAs, follow these best practices:

Use Rotating User-Agents

USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
]

import random
headers = {"User-Agent": random.choice(USER_AGENTS)}

Use Proxy Rotation

PROXIES = ["http://proxy1:port", "http://proxy2:port"]

import random
proxy = {"http": random.choice(PROXIES)}
response = requests.get("https://example.com", proxies=proxy)

Limit Requests Per Second (Throttle Requests)

import time
time.sleep(3) # Wait 3 seconds between requests

Use Session Persistence

import requests
session = requests.Session()
session.get("https://example.com", headers=headers)

5. Using Browser Automation (Selenium) for reCAPTCHA v2

For interactive CAPTCHAs, Selenium with Undetected ChromeDriver works well.

  1. Install Selenium: bashCopyEditpip install selenium undetected-chromedriver
  2. Solve reCAPTCHA using Selenium:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time

# Start an undetectable browser session
driver = uc.Chrome()
driver.get("https://example.com")

# Find the CAPTCHA iframe and click the checkbox
iframe = driver.find_element(By.XPATH, "//iframe[contains(@src, 'recaptcha')]")
driver.switch_to.frame(iframe)
driver.find_element(By.CLASS_NAME, "recaptcha-checkbox").click()

time.sleep(15) # Give time for CAPTCHA to solve manually

driver.switch_to.default_content() # Switch back to main page

6. Solving Text-based CAPTCHAs with OCR (Tesseract OCR)

For simple text CAPTCHAs, use Tesseract OCR:

  1. Install dependencies: bashCopyEditpip install pytesseract pillow
  2. Solve CAPTCHA from an image:
import pytesseract
from PIL import Image

# Load CAPTCHA image
image = Image.open("captcha.png")

# Convert image to text
captcha_text = pytesseract.image_to_string(image)
print("CAPTCHA Text:", captcha_text)

Works for simple CAPTCHAs but fails with distorted text.


7. Handling Slider CAPTCHAs (Cloudflare & GeeTest)

For slider CAPTCHAs, Selenium ActionChains can simulate dragging:

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By

slider = driver.find_element(By.CLASS_NAME, "slider")
action = ActionChains(driver)
action.click_and_hold(slider).move_by_offset(200, 0).release().perform()

May require fine-tuning offsets for different sites.


8. Cloudflare Bypass with Cloudscraper

Cloudflare often blocks bots. Cloudscraper helps bypass it:

  1. Install it: pip install cloudscraper
  2. Use it instead of requests:
import cloudscraper
scraper = cloudscraper.create_scraper()
response = scraper.get("https://example.com")
print(response.text)

Leave a Reply

Your email address will not be published. Required fields are marked *