Xử Lý Rate Limiting Khi Web Scraping

Rate limiting là cách websites giới hạn số requests bạn có thể gửi. Bài viết hướng dẫn xử lý rate limits hiệu quả.

Rate Limiting Là Gì?

Giới hạn số requests từ 1 IP/account trong khoảng thời gian. Ví dụ: 100 requests/phút.

Dấu Hiệu Bị Rate Limited

HTTP 429: Too Many Requests
HTTP 503: Service Unavailable
Retry-After header: Thời gian chờ
CAPTCHA: Challenge xuất hiện
Empty responses: Data không trả về

Chiến Lược Xử Lý

1. Respectful Delays

import time
import random

def scrape_with_delay(urls):
    for url in urls:
        response = requests.get(url)
        process(response)
        
        # Random delay 1-3 giây
        time.sleep(random.uniform(1, 3))

2. Exponential Backoff

import time

def request_with_backoff(url, max_retries=5):
    for attempt in range(max_retries):
        response = requests.get(url)
        
        if response.status_code == 429:
            wait_time = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)
            continue
        
        return response
    
    raise Exception("Max retries exceeded")

3. Respect Retry-After

response = requests.get(url)

if response.status_code == 429:
    retry_after = int(response.headers.get('Retry-After', 60))
    print(f"Waiting {retry_after}s as requested...")
    time.sleep(retry_after)

4. IP Rotation

proxies = [
    'http://proxy1.vinaproxy.com:8080',
    'http://proxy2.vinaproxy.com:8080',
    'http://proxy3.vinaproxy.com:8080'
]

for i, url in enumerate(urls):
    proxy = {'http': proxies[i % len(proxies)]}
    response = requests.get(url, proxies=proxy)

5. Concurrent Limiting

import asyncio
from asyncio import Semaphore

semaphore = Semaphore(5)  # Max 5 concurrent requests

async def fetch(url):
    async with semaphore:
        response = await aiohttp.get(url)
        await asyncio.sleep(0.5)
        return response

Rate Limit Per Platform

Platform	Typical Limit	Penalty
Google	~100/min	CAPTCHA, block
Amazon	~50/min	CAPTCHA
LinkedIn	Rất strict	Account ban
Twitter/X	API limits	Temp ban

Best Practices

Start slow, increase gradually
Monitor success rate
Dùng rotating proxies
Cache results để giảm requests
Scrape off-peak hours

VinaProxy – Vượt Qua Rate Limits

Pool lớn IPs để rotate
Residential IPs trusted
Giá chỉ $0.5/GB

Dùng Thử Ngay →