Async Web Scraping Với Python: Tăng Tốc 10x

Trở lại Tin tức
Tin tức

Async Web Scraping Với Python: Tăng Tốc 10x

Async scraping có thể tăng tốc độ 10-50x so với synchronous. Bài viết hướng dẫn dùng asyncio và aiohttp.

Sync vs Async

Sync Async
1 request tại 1 thời điểm Nhiều requests đồng thời
Chờ response mới tiếp tục Không chờ, làm việc khác
100 URLs = 100 x latency 100 URLs ≈ 1 x latency

Cài Đặt

pip install aiohttp aiofiles

Basic Async Scraping

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3',
    ]
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        
        for url, html in zip(urls, results):
            print(f"{url}: {len(html)} chars")

asyncio.run(main())

Rate Limited Async

import asyncio
import aiohttp
from asyncio import Semaphore

async def fetch_limited(session, url, semaphore):
    async with semaphore:
        async with session.get(url) as response:
            await asyncio.sleep(0.5)  # Delay
            return await response.text()

async def main():
    semaphore = Semaphore(10)  # Max 10 concurrent
    urls = [f'https://example.com/page{i}' for i in range(100)]
    
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_limited(session, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        success = sum(1 for r in results if not isinstance(r, Exception))
        print(f"Success: {success}/{len(urls)}")

asyncio.run(main())

Async Với Proxy

import aiohttp

async def fetch_with_proxy(session, url, proxy):
    async with session.get(url, proxy=proxy) as response:
        return await response.text()

async def main():
    proxy = "http://user:pass@proxy.vinaproxy.com:8080"
    
    async with aiohttp.ClientSession() as session:
        html = await fetch_with_proxy(session, url, proxy)
        print(html[:500])

Error Handling

async def safe_fetch(session, url):
    try:
        async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as response:
            if response.status == 200:
                return await response.text()
            return None
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

Benchmark

# 100 URLs test
# Sync requests: ~60 seconds
# Async aiohttp: ~3 seconds
# Speed up: 20x!

Best Practices

  • Dùng Semaphore để limit concurrency
  • Set timeouts cho mọi requests
  • Handle exceptions với return_exceptions=True
  • Reuse ClientSession

VinaProxy + Async Scraping

  • Handle high concurrency
  • Rotating IPs cho parallel requests
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →