Xử Lý Rate Limiting Khi Web Scraping
Rate limiting là cách websites giới hạn số requests bạn có thể gửi. Bài viết hướng dẫn xử lý rate limits hiệu quả.
Rate Limiting Là Gì?
Giới hạn số requests từ 1 IP/account trong khoảng thời gian. Ví dụ: 100 requests/phút.
Dấu Hiệu Bị Rate Limited
- HTTP 429: Too Many Requests
- HTTP 503: Service Unavailable
- Retry-After header: Thời gian chờ
- CAPTCHA: Challenge xuất hiện
- Empty responses: Data không trả về
Chiến Lược Xử Lý
1. Respectful Delays
import time
import random
def scrape_with_delay(urls):
for url in urls:
response = requests.get(url)
process(response)
# Random delay 1-3 giây
time.sleep(random.uniform(1, 3))
2. Exponential Backoff
import time
def request_with_backoff(url, max_retries=5):
for attempt in range(max_retries):
response = requests.get(url)
if response.status_code == 429:
wait_time = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
continue
return response
raise Exception("Max retries exceeded")
3. Respect Retry-After
response = requests.get(url)
if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
print(f"Waiting {retry_after}s as requested...")
time.sleep(retry_after)
4. IP Rotation
proxies = [
'http://proxy1.vinaproxy.com:8080',
'http://proxy2.vinaproxy.com:8080',
'http://proxy3.vinaproxy.com:8080'
]
for i, url in enumerate(urls):
proxy = {'http': proxies[i % len(proxies)]}
response = requests.get(url, proxies=proxy)
5. Concurrent Limiting
import asyncio
from asyncio import Semaphore
semaphore = Semaphore(5) # Max 5 concurrent requests
async def fetch(url):
async with semaphore:
response = await aiohttp.get(url)
await asyncio.sleep(0.5)
return response
Rate Limit Per Platform
| Platform | Typical Limit | Penalty |
|---|---|---|
| ~100/min | CAPTCHA, block | |
| Amazon | ~50/min | CAPTCHA |
| Rất strict | Account ban | |
| Twitter/X | API limits | Temp ban |
Best Practices
- Start slow, increase gradually
- Monitor success rate
- Dùng rotating proxies
- Cache results để giảm requests
- Scrape off-peak hours
VinaProxy – Vượt Qua Rate Limits
- Pool lớn IPs để rotate
- Residential IPs trusted
- Giá chỉ $0.5/GB
