Scaling Web Scrapers: Từ Script Đến Production System
Khi scraping lớn, cần scale. Bài viết hướng dẫn scaling scrapers từ đơn giản đến enterprise-level.
Các Mức Scale
- Level 1: Single script, sequential
- Level 2: Async/threading
- Level 3: Multiple processes
- Level 4: Distributed (multiple machines)
- Level 5: Cloud auto-scaling
Level 1: Sequential (Baseline)
# Simple, slow, but works
for url in urls:
data = scrape(url)
save(data)
time.sleep(1)
# ~1 request/second = 3,600/hour
Level 2: Async với aiohttp
import aiohttp
import asyncio
async def scrape(session, url):
async with session.get(url) as response:
return await response.text()
async def main(urls):
connector = aiohttp.TCPConnector(limit=50)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [scrape(session, url) for url in urls]
results = await asyncio.gather(*tasks)
return results
# ~50 concurrent = 50,000+/hour
asyncio.run(main(urls))
Level 3: Multiprocessing
from multiprocessing import Pool
import requests
def scrape_url(url):
response = requests.get(url)
return parse(response.text)
# Use all CPU cores
with Pool(processes=8) as pool:
results = pool.map(scrape_url, urls)
# Good for CPU-bound parsing
Level 4: Distributed với Celery
# tasks.py
from celery import Celery
app = Celery('scraper', broker='redis://localhost:6379/0')
@app.task
def scrape_task(url):
data = scrape(url)
save_to_db(data)
return {'url': url, 'status': 'success'}
# Dispatch tasks
from tasks import scrape_task
for url in urls:
scrape_task.delay(url)
# Run workers on multiple machines:
# celery -A tasks worker --concurrency=10
Level 5: Kubernetes Auto-scaling
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper
spec:
replicas: 3
template:
spec:
containers:
- name: scraper
image: my-scraper:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scraper-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: scraper
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Queue-based Architecture
import redis
import json
r = redis.Redis()
# Producer: Add URLs to queue
def enqueue_urls(urls):
for url in urls:
r.lpush('scrape_queue', url)
# Consumer: Process URLs
def worker():
while True:
url = r.brpop('scrape_queue', timeout=5)
if url:
data = scrape(url[1].decode())
r.lpush('results_queue', json.dumps(data))
Throughput Comparison
| Level | Throughput | Complexity |
|---|---|---|
| Sequential | ~3,600/hour | Low |
| Async | ~50,000/hour | Medium |
| Multiprocess | ~100,000/hour | Medium |
| Distributed | ~1M+/hour | High |
VinaProxy + Scaling
- Unlimited concurrent connections
- Auto IP rotation
- Giá chỉ $0.5/GB
