Async Web Scraping Với Python: Tăng Tốc 10x
Async scraping có thể tăng tốc độ 10-50x so với synchronous. Bài viết hướng dẫn dùng asyncio và aiohttp.
Sync vs Async
| Sync | Async |
|---|---|
| 1 request tại 1 thời điểm | Nhiều requests đồng thời |
| Chờ response mới tiếp tục | Không chờ, làm việc khác |
| 100 URLs = 100 x latency | 100 URLs ≈ 1 x latency |
Cài Đặt
pip install aiohttp aiofiles
Basic Async Scraping
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3',
]
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
for url, html in zip(urls, results):
print(f"{url}: {len(html)} chars")
asyncio.run(main())
Rate Limited Async
import asyncio
import aiohttp
from asyncio import Semaphore
async def fetch_limited(session, url, semaphore):
async with semaphore:
async with session.get(url) as response:
await asyncio.sleep(0.5) # Delay
return await response.text()
async def main():
semaphore = Semaphore(10) # Max 10 concurrent
urls = [f'https://example.com/page{i}' for i in range(100)]
async with aiohttp.ClientSession() as session:
tasks = [fetch_limited(session, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
success = sum(1 for r in results if not isinstance(r, Exception))
print(f"Success: {success}/{len(urls)}")
asyncio.run(main())
Async Với Proxy
import aiohttp
async def fetch_with_proxy(session, url, proxy):
async with session.get(url, proxy=proxy) as response:
return await response.text()
async def main():
proxy = "http://user:pass@proxy.vinaproxy.com:8080"
async with aiohttp.ClientSession() as session:
html = await fetch_with_proxy(session, url, proxy)
print(html[:500])
Error Handling
async def safe_fetch(session, url):
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
if response.status == 200:
return await response.text()
return None
except Exception as e:
print(f"Error fetching {url}: {e}")
return None
Benchmark
# 100 URLs test
# Sync requests: ~60 seconds
# Async aiohttp: ~3 seconds
# Speed up: 20x!
Best Practices
- Dùng Semaphore để limit concurrency
- Set timeouts cho mọi requests
- Handle exceptions với return_exceptions=True
- Reuse ClientSession
VinaProxy + Async Scraping
- Handle high concurrency
- Rotating IPs cho parallel requests
- Giá chỉ $0.5/GB
