Cloud Scraping: Chạy Scrapers Trên Cloud

Trở lại Tin tức
Tin tức

Cloud Scraping: Chạy Scrapers Trên Cloud

Chạy scrapers trên cloud cho reliability và scale. Bài viết hướng dẫn deploy scrapers lên cloud.

Tại Sao Dùng Cloud?

  • 24/7 uptime: Không phụ thuộc máy local
  • Scalability: Tăng resources khi cần
  • Global IPs: Scrape từ nhiều locations
  • Cost-effective: Pay-per-use

Option 1: AWS Lambda (Serverless)

# handler.py
import json
import requests
from bs4 import BeautifulSoup

def scrape_handler(event, context):
    url = event.get('url')
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # Parse data
    data = {
        'title': soup.select_one('h1').text,
        'url': url
    }
    
    return {
        'statusCode': 200,
        'body': json.dumps(data)
    }

# serverless.yml
# service: scraper
# provider:
#   name: aws
#   runtime: python3.9
# functions:
#   scrape:
#     handler: handler.scrape_handler
#     timeout: 30

Option 2: Google Cloud Functions

# main.py
import functions_framework
import requests
from bs4 import BeautifulSoup

@functions_framework.http
def scrape(request):
    url = request.args.get('url')
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    return {
        'title': soup.select_one('h1').text,
        'status': 'success'
    }

# Deploy:
# gcloud functions deploy scrape --runtime python39 --trigger-http

Option 3: DigitalOcean Droplet

# Setup VPS
ssh root@your-droplet-ip

# Install dependencies
apt update && apt install -y python3-pip
pip3 install requests beautifulsoup4 lxml

# Clone and run
git clone https://github.com/your/scraper
cd scraper
python3 scraper.py

# Use screen for background
screen -S scraper
python3 scraper.py
# Ctrl+A, D to detach

Option 4: GitHub Actions (Free)

# .github/workflows/scrape.yml
name: Daily Scrape

on:
  schedule:
    - cron: '0 6 * * *'  # 6 AM UTC daily

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install deps
        run: pip install requests beautifulsoup4
      
      - name: Run scraper
        run: python scraper.py
        env:
          PROXY_URL: ${{ secrets.PROXY_URL }}
      
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: scraped-data
          path: output/

Option 5: Render.com

# render.yaml
services:
  - type: worker
    name: scraper
    env: python
    buildCommand: pip install -r requirements.txt
    startCommand: python scraper.py
    plan: starter

# Cron job
  - type: cron
    name: daily-scrape
    schedule: "0 6 * * *"
    buildCommand: pip install -r requirements.txt
    startCommand: python scraper.py

Cost Comparison

Service Cost Best For
AWS Lambda ~$0.20/1M requests Sporadic jobs
GCP Functions ~$0.40/1M requests Quick deploys
DigitalOcean $5-20/month 24/7 scraping
GitHub Actions Free (2000 mins) Daily jobs
Render Free tier available Simple deploys

VinaProxy + Cloud Scraping

  • Works với tất cả cloud providers
  • API-based authentication
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →