10 Lỗi Web Scraping Phổ Biến Và Cách Tránh
Học từ mistakes của người khác. Bài viết liệt kê 10 lỗi scraping phổ biến và cách fix.
1. Không Có Error Handling
# ❌ Bad
response = requests.get(url)
data = response.json() # Crashes if not JSON
# ✅ Good
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
data = response.json()
except requests.RequestException as e:
logging.error(f"Request failed: {e}")
data = None
except json.JSONDecodeError:
logging.error("Invalid JSON response")
data = None
2. Hardcoded Selectors
# ❌ Bad - Breaks when class changes
price = soup.select_one('.sc-fzXfQW.bGYYPQ').text
# ✅ Good - More robust selectors
price = soup.select_one('[data-testid="price"]').text
# Or use multiple fallbacks
price = (soup.select_one('.price') or
soup.select_one('[itemprop="price"]') or
soup.select_one('.product-price'))
3. Không Set Timeout
# ❌ Bad - Có thể hang mãi mãi
response = requests.get(url)
# ✅ Good - Always set timeout
response = requests.get(url, timeout=30)
# Or with connect and read timeouts
response = requests.get(url, timeout=(5, 30))
4. Ignore robots.txt
# ❌ Bad - Scrape mọi thứ không check
# ✅ Good - Check robots.txt first
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
if rp.can_fetch('*', url):
scrape(url)
else:
print(f"Blocked by robots.txt: {url}")
5. Scrape Quá Nhanh
# ❌ Bad - No delay, will get blocked
for url in urls:
scrape(url)
# ✅ Good - Add delays
import time
import random
for url in urls:
scrape(url)
time.sleep(random.uniform(1, 3)) # Random 1-3 seconds
6. Không Dùng Session
# ❌ Bad - New connection mỗi request
for url in urls:
response = requests.get(url)
# ✅ Good - Reuse session (connection pooling)
session = requests.Session()
for url in urls:
response = session.get(url)
7. Không Validate Data
# ❌ Bad - Save whatever we get
data = {'price': price_text}
save(data)
# ✅ Good - Validate before saving
def validate_product(data):
required = ['name', 'price', 'url']
for field in required:
if not data.get(field):
return False
if data['price'] <= 0:
return False
return True
if validate_product(data):
save(data)
else:
logging.warning(f"Invalid data: {data}")
8. Không Log Đủ
# ❌ Bad - No logging
scrape(url)
# ✅ Good - Comprehensive logging
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
filename='scraper.log'
)
logging.info(f"Starting scrape: {url}")
try:
data = scrape(url)
logging.info(f"Success: {len(data)} items from {url}")
except Exception as e:
logging.error(f"Failed: {url} - {e}")
9. Lưu Data Cuối Cùng
# ❌ Bad - Mất hết data khi crash
all_data = []
for url in urls:
data = scrape(url)
all_data.append(data)
save(all_data) # Crash here = lose everything
# ✅ Good - Save incrementally
for url in urls:
data = scrape(url)
save_to_db(data) # Save immediately
10. Không Test Selectors
# ❌ Bad - Deploy và hope for the best
# ✅ Good - Test selectors trước
def test_selectors():
response = requests.get(test_url)
soup = BeautifulSoup(response.text, 'lxml')
assert soup.select('.product'), "Products not found"
assert soup.select('.price'), "Prices not found"
assert soup.select('.name'), "Names not found"
print("All selectors working!")
test_selectors()
Summary
| Mistake | Fix |
|---|---|
| No error handling | try/except everywhere |
| Fragile selectors | Use data-testid, fallbacks |
| No timeout | Always set timeout |
| Too fast | Add random delays |
| No validation | Validate before save |
VinaProxy + Avoid Mistakes
- Reliable proxy để tránh blocks
- Professional scraping setup
- Giá chỉ $0.5/GB
