Xử Lý Rate Limiting Khi Web Scraping

Trở lại Tin tức
Tin tức

Xử Lý Rate Limiting Khi Web Scraping

Rate limiting là cách websites giới hạn số requests bạn có thể gửi. Bài viết hướng dẫn xử lý rate limits hiệu quả.

Rate Limiting Là Gì?

Giới hạn số requests từ 1 IP/account trong khoảng thời gian. Ví dụ: 100 requests/phút.

Dấu Hiệu Bị Rate Limited

  • HTTP 429: Too Many Requests
  • HTTP 503: Service Unavailable
  • Retry-After header: Thời gian chờ
  • CAPTCHA: Challenge xuất hiện
  • Empty responses: Data không trả về

Chiến Lược Xử Lý

1. Respectful Delays

import time
import random

def scrape_with_delay(urls):
    for url in urls:
        response = requests.get(url)
        process(response)
        
        # Random delay 1-3 giây
        time.sleep(random.uniform(1, 3))

2. Exponential Backoff

import time

def request_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        response = requests.get(url)
        
        if response.status_code == 429:
            wait_time = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
            continue
        
        return response
    
    raise Exception("Max retries exceeded")

3. Respect Retry-After

response = requests.get(url)

if response.status_code == 429:
    retry_after = int(response.headers.get('Retry-After', 60))
    print(f"Waiting {retry_after}s as requested...")
    time.sleep(retry_after)

4. IP Rotation

proxies = [
    'http://proxy1.vinaproxy.com:8080',
    'http://proxy2.vinaproxy.com:8080',
    'http://proxy3.vinaproxy.com:8080'
]

for i, url in enumerate(urls):
    proxy = {'http': proxies[i % len(proxies)]}
    response = requests.get(url, proxies=proxy)

5. Concurrent Limiting

import asyncio
from asyncio import Semaphore

semaphore = Semaphore(5)  # Max 5 concurrent requests

async def fetch(url):
    async with semaphore:
        response = await aiohttp.get(url)
        await asyncio.sleep(0.5)
        return response

Rate Limit Per Platform

Platform Typical Limit Penalty
Google ~100/min CAPTCHA, block
Amazon ~50/min CAPTCHA
LinkedIn Rất strict Account ban
Twitter/X API limits Temp ban

Best Practices

  • Start slow, increase gradually
  • Monitor success rate
  • Dùng rotating proxies
  • Cache results để giảm requests
  • Scrape off-peak hours

VinaProxy – Vượt Qua Rate Limits

  • Pool lớn IPs để rotate
  • Residential IPs trusted
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →