Lên Lịch Web Scraping: Tự Động Hóa Với Cron Và Python

Trở lại Tin tức
Tin tức

Lên Lịch Web Scraping: Tự Động Hóa Với Cron Và Python

Scraping thủ công mất thời gian. Bài viết hướng dẫn tự động hóa scraping theo lịch.

Các Phương Pháp Scheduling

  • Cron (Linux): Built-in, reliable
  • Task Scheduler (Windows): GUI-based
  • APScheduler (Python): In-app scheduling
  • Celery: Distributed task queue

Linux Cron

# Edit crontab
crontab -e

# Format: minute hour day month weekday command
# Chạy mỗi giờ
0 * * * * /usr/bin/python3 /path/to/scraper.py

# Chạy mỗi ngày lúc 6:00 AM
0 6 * * * /usr/bin/python3 /path/to/scraper.py

# Chạy mỗi 15 phút
*/15 * * * * /usr/bin/python3 /path/to/scraper.py

# Chạy thứ 2-6 lúc 9:00 AM
0 9 * * 1-5 /usr/bin/python3 /path/to/scraper.py

Python Script Cho Cron

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime

def scrape():
    url = "https://example.com/products"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    products = []
    for item in soup.select('.product'):
        products.append({
            'name': item.select_one('.name').text.strip(),
            'price': item.select_one('.price').text.strip(),
            'scraped_at': datetime.now().isoformat()
        })
    
    # Save với timestamp
    filename = f"products_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
    with open(filename, 'w') as f:
        json.dump(products, f)
    
    print(f"Scraped {len(products)} products to {filename}")

if __name__ == '__main__':
    scrape()

APScheduler (Python)

from apscheduler.schedulers.blocking import BlockingScheduler

def scrape_job():
    print("Running scraper...")
    # Your scraping code here

scheduler = BlockingScheduler()

# Mỗi giờ
scheduler.add_job(scrape_job, 'interval', hours=1)

# Mỗi ngày lúc 6:00
scheduler.add_job(scrape_job, 'cron', hour=6, minute=0)

# Mỗi 30 phút
scheduler.add_job(scrape_job, 'interval', minutes=30)

scheduler.start()

Celery (Distributed)

from celery import Celery
from celery.schedules import crontab

app = Celery('scraper')

@app.task
def scrape_products():
    # Scraping logic
    pass

# celeryconfig.py
app.conf.beat_schedule = {
    'scrape-every-hour': {
        'task': 'scraper.scrape_products',
        'schedule': crontab(minute=0),
    },
}

Logging Scheduled Jobs

import logging

logging.basicConfig(
    filename='/var/log/scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

def scrape():
    logging.info("Starting scheduled scrape")
    try:
        # Scrape logic
        logging.info(f"Scraped {count} items")
    except Exception as e:
        logging.error(f"Scrape failed: {e}")

Best Practices

  • Log mọi runs để debug
  • Handle errors gracefully
  • Set timeouts cho mỗi job
  • Monitor job success rate
  • Scrape off-peak hours

VinaProxy + Scheduled Scraping

  • Reliable proxies cho automated jobs
  • 24/7 uptime
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →