Cloud Scraping: Chạy Scrapers Trên Cloud
Chạy scrapers trên cloud cho reliability và scale. Bài viết hướng dẫn deploy scrapers lên cloud.
Tại Sao Dùng Cloud?
- 24/7 uptime: Không phụ thuộc máy local
- Scalability: Tăng resources khi cần
- Global IPs: Scrape từ nhiều locations
- Cost-effective: Pay-per-use
Option 1: AWS Lambda (Serverless)
# handler.py
import json
import requests
from bs4 import BeautifulSoup
def scrape_handler(event, context):
url = event.get('url')
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Parse data
data = {
'title': soup.select_one('h1').text,
'url': url
}
return {
'statusCode': 200,
'body': json.dumps(data)
}
# serverless.yml
# service: scraper
# provider:
# name: aws
# runtime: python3.9
# functions:
# scrape:
# handler: handler.scrape_handler
# timeout: 30
Option 2: Google Cloud Functions
# main.py
import functions_framework
import requests
from bs4 import BeautifulSoup
@functions_framework.http
def scrape(request):
url = request.args.get('url')
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return {
'title': soup.select_one('h1').text,
'status': 'success'
}
# Deploy:
# gcloud functions deploy scrape --runtime python39 --trigger-http
Option 3: DigitalOcean Droplet
# Setup VPS
ssh root@your-droplet-ip
# Install dependencies
apt update && apt install -y python3-pip
pip3 install requests beautifulsoup4 lxml
# Clone and run
git clone https://github.com/your/scraper
cd scraper
python3 scraper.py
# Use screen for background
screen -S scraper
python3 scraper.py
# Ctrl+A, D to detach
Option 4: GitHub Actions (Free)
# .github/workflows/scrape.yml
name: Daily Scrape
on:
schedule:
- cron: '0 6 * * *' # 6 AM UTC daily
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install deps
run: pip install requests beautifulsoup4
- name: Run scraper
run: python scraper.py
env:
PROXY_URL: ${{ secrets.PROXY_URL }}
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: scraped-data
path: output/
Option 5: Render.com
# render.yaml
services:
- type: worker
name: scraper
env: python
buildCommand: pip install -r requirements.txt
startCommand: python scraper.py
plan: starter
# Cron job
- type: cron
name: daily-scrape
schedule: "0 6 * * *"
buildCommand: pip install -r requirements.txt
startCommand: python scraper.py
Cost Comparison
| Service | Cost | Best For |
|---|---|---|
| AWS Lambda | ~$0.20/1M requests | Sporadic jobs |
| GCP Functions | ~$0.40/1M requests | Quick deploys |
| DigitalOcean | $5-20/month | 24/7 scraping |
| GitHub Actions | Free (2000 mins) | Daily jobs |
| Render | Free tier available | Simple deploys |
VinaProxy + Cloud Scraping
- Works với tất cả cloud providers
- API-based authentication
- Giá chỉ $0.5/GB
