Web Scraping Best Practices: Checklist Hoàn Chỉnh
Tổng hợp tất cả best practices thành một checklist. Dùng để review trước khi deploy scrapers.
🔧 Setup
- ☐ Virtual environment riêng cho project
- ☐ requirements.txt với pinned versions
- ☐ .env file cho credentials (không commit!)
- ☐ Proper logging configuration
- ☐ Error handling framework
🌐 Requests
- ☐ Realistic User-Agent
- ☐ Accept, Accept-Language headers
- ☐ Referer header khi cần
- ☐ Session để maintain cookies
- ☐ Timeouts cho mọi requests
- ☐ Retry logic với exponential backoff
⏱️ Rate Limiting
- ☐ Delay giữa requests (1-5 seconds)
- ☐ Random delays để tránh patterns
- ☐ Respect Crawl-delay trong robots.txt
- ☐ Limit concurrent connections
- ☐ Back off khi gặp 429/503
🔄 Proxy Management
- ☐ Rotating proxy pool
- ☐ Health checks cho proxies
- ☐ Fallback khi proxy fails
- ☐ Geo-targeting phù hợp
- ☐ Monitor proxy success rate
🛡️ Anti-Detection
- ☐ Browser fingerprint spoofing
- ☐ Headless detection bypass
- ☐ Consistent browser profile
- ☐ Human-like behavior (mouse, scroll)
- ☐ CAPTCHA handling strategy
📊 Data Quality
- ☐ Validate data trước khi save
- ☐ Handle missing fields gracefully
- ☐ Normalize và clean data
- ☐ Deduplicate records
- ☐ Type checking và conversion
💾 Storage
- ☐ Incremental saves (không mất data khi crash)
- ☐ Proper encoding (UTF-8)
- ☐ Timestamps trên mọi records
- ☐ Source URL tracking
- ☐ Backup strategy
🔍 Monitoring
- ☐ Success/failure rate tracking
- ☐ Alert khi error rate cao
- ☐ Website change detection
- ☐ Performance metrics
- ☐ Dashboard cho visibility
📝 Logging
- ☐ Log levels (DEBUG, INFO, ERROR)
- ☐ Structured logging (JSON)
- ☐ Log rotation
- ☐ Request/response details
- ☐ Timing information
⚖️ Legal & Ethics
- ☐ Check robots.txt
- ☐ Read Terms of Service
- ☐ Respect copyright
- ☐ Handle PII carefully
- ☐ Don’t overload servers
🧪 Testing
- ☐ Unit tests cho parsing functions
- ☐ Mock HTTP responses
- ☐ Integration tests
- ☐ Selector validation tests
- ☐ CI/CD pipeline
🚀 Production
- ☐ Docker container
- ☐ Environment-based config
- ☐ Graceful shutdown handling
- ☐ Health check endpoint
- ☐ Auto-restart on failure
Quick Reference Code
# Complete request template
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
session.mount('http://', HTTPAdapter(max_retries=retry))
session.mount('https://', HTTPAdapter(max_retries=retry))
response = session.get(
url,
headers={
'User-Agent': 'Mozilla/5.0...',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'vi-VN,vi;q=0.9,en;q=0.8'
},
proxies={'http': proxy, 'https': proxy},
timeout=30
)
VinaProxy + Best Practices
- Follow checklist với reliable proxies
- Professional scraping infrastructure
- Giá chỉ $0.5/GB
