CAPTCHAs (Completely Automated Public Turing Test to Tell Computers and Humans Apart) are security measures used to prevent automated bots from accessing websites. They can be text-based, image-based, audio-based, or interactive puzzles.
Common CAPTCHA Types in Scraping
- Text-based CAPTCHAs (Distorted letters/numbers)
- Image-based CAPTCHAs (Select objects in images)
- reCAPTCHA v2 & v3 (Google’s advanced bot detection)
- hCaptcha (Similar to reCAPTCHA, used by Cloudflare)
- Slider CAPTCHAs (Drag the slider to fit an image)
2. Strategies to Handle CAPTCHAs
There are multiple ways to bypass CAPTCHAs in web scraping:
Method | Effectiveness | Difficulty | Usage |
---|---|---|---|
Using Anti-CAPTCHA Services | High | Medium | For large-scale scraping |
Avoiding CAPTCHA Triggers | High | Easy | Prevents triggering CAPTCHAs |
Manual CAPTCHA Solving | High | Time-consuming | For small-scale scraping |
Browser Automation (Selenium) | Medium | Slow | Works for interactive CAPTCHAs |
OCR-based Solving (Tesseract) | Low | Medium | For text-based CAPTCHAs |
3. Using CAPTCHA Solving Services
Paid CAPTCHA-solving services can automate the process. Some popular ones include:
- 2Captcha (https://2captcha.com/)
- Anti-Captcha (https://anti-captcha.com/)
- DeathByCaptcha (https://deathbycaptcha.com/)
Example: Solving reCAPTCHA with 2Captcha
- Sign up at 2Captcha and get an API key.
- Install the Python library: bashCopyEdit
pip install requests
- Solve CAPTCHA using API:
import requests
import time
API_KEY = "your_2captcha_api_key"
site_key = "6Lc_aX0UAAAAANNoYtspdmEN0TkWgxJ_Wy3s9N_v" # Example reCAPTCHA site key
url = "https://example.com"
# Step 1: Send CAPTCHA solving request
captcha_request = requests.get(f"http://2captcha.com/in.php?key={API_KEY}&method=userrecaptcha&googlekey={site_key}&pageurl={url}")
captcha_id = captcha_request.text.split('|')[1]
# Step 2: Wait for the CAPTCHA to be solved
time.sleep(15) # Give some time for solving
captcha_response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
while "CAPCHA_NOT_READY" in captcha_response.text:
time.sleep(5)
captcha_response = requests.get(f"http://2captcha.com/res.php?key={API_KEY}&action=get&id={captcha_id}")
captcha_solution = captcha_response.text.split('|')[1]
# Step 3: Submit the CAPTCHA solution with your request
response = requests.post(url, data={"g-recaptcha-response": captcha_solution})
print(response.text)
4. Avoiding CAPTCHA Triggers
To prevent triggering CAPTCHAs, follow these best practices:
✔ Use Rotating User-Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
]
import random
headers = {"User-Agent": random.choice(USER_AGENTS)}
✔ Use Proxy Rotation
PROXIES = ["http://proxy1:port", "http://proxy2:port"]
import random
proxy = {"http": random.choice(PROXIES)}
response = requests.get("https://example.com", proxies=proxy)
✔ Limit Requests Per Second (Throttle Requests)
import time
time.sleep(3) # Wait 3 seconds between requests
✔ Use Session Persistence
import requests
session = requests.Session()
session.get("https://example.com", headers=headers)
5. Using Browser Automation (Selenium) for reCAPTCHA v2
For interactive CAPTCHAs, Selenium with Undetected ChromeDriver works well.
- Install Selenium: bashCopyEdit
pip install selenium undetected-chromedriver
- Solve reCAPTCHA using Selenium:
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import time
# Start an undetectable browser session
driver = uc.Chrome()
driver.get("https://example.com")
# Find the CAPTCHA iframe and click the checkbox
iframe = driver.find_element(By.XPATH, "//iframe[contains(@src, 'recaptcha')]")
driver.switch_to.frame(iframe)
driver.find_element(By.CLASS_NAME, "recaptcha-checkbox").click()
time.sleep(15) # Give time for CAPTCHA to solve manually
driver.switch_to.default_content() # Switch back to main page
6. Solving Text-based CAPTCHAs with OCR (Tesseract OCR)
For simple text CAPTCHAs, use Tesseract OCR:
- Install dependencies: bashCopyEdit
pip install pytesseract pillow
- Solve CAPTCHA from an image:
import pytesseract
from PIL import Image
# Load CAPTCHA image
image = Image.open("captcha.png")
# Convert image to text
captcha_text = pytesseract.image_to_string(image)
print("CAPTCHA Text:", captcha_text)
Works for simple CAPTCHAs but fails with distorted text.
7. Handling Slider CAPTCHAs (Cloudflare & GeeTest)
For slider CAPTCHAs, Selenium ActionChains can simulate dragging:
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
slider = driver.find_element(By.CLASS_NAME, "slider")
action = ActionChains(driver)
action.click_and_hold(slider).move_by_offset(200, 0).release().perform()
May require fine-tuning offsets for different sites.
8. Cloudflare Bypass with Cloudscraper
Cloudflare often blocks bots. Cloudscraper helps bypass it:
- Install it:
pip install cloudscraper
- Use it instead of
requests
:
import cloudscraper
scraper = cloudscraper.create_scraper()
response = scraper.get("https://example.com")
print(response.text)