Lên Lịch Web Scraping: Tự Động Hóa Với Cron Và Python
Scraping thủ công mất thời gian. Bài viết hướng dẫn tự động hóa scraping theo lịch.
Các Phương Pháp Scheduling
- Cron (Linux): Built-in, reliable
- Task Scheduler (Windows): GUI-based
- APScheduler (Python): In-app scheduling
- Celery: Distributed task queue
Linux Cron
# Edit crontab
crontab -e
# Format: minute hour day month weekday command
# Chạy mỗi giờ
0 * * * * /usr/bin/python3 /path/to/scraper.py
# Chạy mỗi ngày lúc 6:00 AM
0 6 * * * /usr/bin/python3 /path/to/scraper.py
# Chạy mỗi 15 phút
*/15 * * * * /usr/bin/python3 /path/to/scraper.py
# Chạy thứ 2-6 lúc 9:00 AM
0 9 * * 1-5 /usr/bin/python3 /path/to/scraper.py
Python Script Cho Cron
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
def scrape():
url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
products = []
for item in soup.select('.product'):
products.append({
'name': item.select_one('.name').text.strip(),
'price': item.select_one('.price').text.strip(),
'scraped_at': datetime.now().isoformat()
})
# Save với timestamp
filename = f"products_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
with open(filename, 'w') as f:
json.dump(products, f)
print(f"Scraped {len(products)} products to {filename}")
if __name__ == '__main__':
scrape()
APScheduler (Python)
from apscheduler.schedulers.blocking import BlockingScheduler
def scrape_job():
print("Running scraper...")
# Your scraping code here
scheduler = BlockingScheduler()
# Mỗi giờ
scheduler.add_job(scrape_job, 'interval', hours=1)
# Mỗi ngày lúc 6:00
scheduler.add_job(scrape_job, 'cron', hour=6, minute=0)
# Mỗi 30 phút
scheduler.add_job(scrape_job, 'interval', minutes=30)
scheduler.start()
Celery (Distributed)
from celery import Celery
from celery.schedules import crontab
app = Celery('scraper')
@app.task
def scrape_products():
# Scraping logic
pass
# celeryconfig.py
app.conf.beat_schedule = {
'scrape-every-hour': {
'task': 'scraper.scrape_products',
'schedule': crontab(minute=0),
},
}
Logging Scheduled Jobs
import logging
logging.basicConfig(
filename='/var/log/scraper.log',
level=logging.INFO,
format='%(asctime)s - %(message)s'
)
def scrape():
logging.info("Starting scheduled scrape")
try:
# Scrape logic
logging.info(f"Scraped {count} items")
except Exception as e:
logging.error(f"Scrape failed: {e}")
Best Practices
- Log mọi runs để debug
- Handle errors gracefully
- Set timeouts cho mỗi job
- Monitor job success rate
- Scrape off-peak hours
VinaProxy + Scheduled Scraping
- Reliable proxies cho automated jobs
- 24/7 uptime
- Giá chỉ $0.5/GB
