Scaling Web Scrapers: Từ Script Đến Production System

Trở lại Tin tức
Tin tức

Scaling Web Scrapers: Từ Script Đến Production System

Khi scraping lớn, cần scale. Bài viết hướng dẫn scaling scrapers từ đơn giản đến enterprise-level.

Các Mức Scale

  • Level 1: Single script, sequential
  • Level 2: Async/threading
  • Level 3: Multiple processes
  • Level 4: Distributed (multiple machines)
  • Level 5: Cloud auto-scaling

Level 1: Sequential (Baseline)

# Simple, slow, but works
for url in urls:
    data = scrape(url)
    save(data)
    time.sleep(1)

# ~1 request/second = 3,600/hour

Level 2: Async với aiohttp

import aiohttp
import asyncio

async def scrape(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main(urls):
    connector = aiohttp.TCPConnector(limit=50)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [scrape(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    return results

# ~50 concurrent = 50,000+/hour
asyncio.run(main(urls))

Level 3: Multiprocessing

from multiprocessing import Pool
import requests

def scrape_url(url):
    response = requests.get(url)
    return parse(response.text)

# Use all CPU cores
with Pool(processes=8) as pool:
    results = pool.map(scrape_url, urls)

# Good for CPU-bound parsing

Level 4: Distributed với Celery

# tasks.py
from celery import Celery

app = Celery('scraper', broker='redis://localhost:6379/0')

@app.task
def scrape_task(url):
    data = scrape(url)
    save_to_db(data)
    return {'url': url, 'status': 'success'}

# Dispatch tasks
from tasks import scrape_task

for url in urls:
    scrape_task.delay(url)

# Run workers on multiple machines:
# celery -A tasks worker --concurrency=10

Level 5: Kubernetes Auto-scaling

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: scraper
        image: my-scraper:latest
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scraper
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Queue-based Architecture

import redis
import json

r = redis.Redis()

# Producer: Add URLs to queue
def enqueue_urls(urls):
    for url in urls:
        r.lpush('scrape_queue', url)

# Consumer: Process URLs
def worker():
    while True:
        url = r.brpop('scrape_queue', timeout=5)
        if url:
            data = scrape(url[1].decode())
            r.lpush('results_queue', json.dumps(data))

Throughput Comparison

Level Throughput Complexity
Sequential ~3,600/hour Low
Async ~50,000/hour Medium
Multiprocess ~100,000/hour Medium
Distributed ~1M+/hour High

VinaProxy + Scaling

  • Unlimited concurrent connections
  • Auto IP rotation
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →