10 Lỗi Web Scraping Phổ Biến Và Cách Tránh

Trở lại Tin tức
Tin tức

10 Lỗi Web Scraping Phổ Biến Và Cách Tránh

Học từ mistakes của người khác. Bài viết liệt kê 10 lỗi scraping phổ biến và cách fix.

1. Không Có Error Handling

# ❌ Bad
response = requests.get(url)
data = response.json()  # Crashes if not JSON

# ✅ Good
try:
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    data = response.json()
except requests.RequestException as e:
    logging.error(f"Request failed: {e}")
    data = None
except json.JSONDecodeError:
    logging.error("Invalid JSON response")
    data = None

2. Hardcoded Selectors

# ❌ Bad - Breaks when class changes
price = soup.select_one('.sc-fzXfQW.bGYYPQ').text

# ✅ Good - More robust selectors
price = soup.select_one('[data-testid="price"]').text
# Or use multiple fallbacks
price = (soup.select_one('.price') or 
         soup.select_one('[itemprop="price"]') or
         soup.select_one('.product-price'))

3. Không Set Timeout

# ❌ Bad - Có thể hang mãi mãi
response = requests.get(url)

# ✅ Good - Always set timeout
response = requests.get(url, timeout=30)

# Or with connect and read timeouts
response = requests.get(url, timeout=(5, 30))

4. Ignore robots.txt

# ❌ Bad - Scrape mọi thứ không check

# ✅ Good - Check robots.txt first
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('*', url):
    scrape(url)
else:
    print(f"Blocked by robots.txt: {url}")

5. Scrape Quá Nhanh

# ❌ Bad - No delay, will get blocked
for url in urls:
    scrape(url)

# ✅ Good - Add delays
import time
import random

for url in urls:
    scrape(url)
    time.sleep(random.uniform(1, 3))  # Random 1-3 seconds

6. Không Dùng Session

# ❌ Bad - New connection mỗi request
for url in urls:
    response = requests.get(url)

# ✅ Good - Reuse session (connection pooling)
session = requests.Session()
for url in urls:
    response = session.get(url)

7. Không Validate Data

# ❌ Bad - Save whatever we get
data = {'price': price_text}
save(data)

# ✅ Good - Validate before saving
def validate_product(data):
    required = ['name', 'price', 'url']
    for field in required:
        if not data.get(field):
            return False
    if data['price'] <= 0:
        return False
    return True

if validate_product(data):
    save(data)
else:
    logging.warning(f"Invalid data: {data}")

8. Không Log Đủ

# ❌ Bad - No logging
scrape(url)

# ✅ Good - Comprehensive logging
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename='scraper.log'
)

logging.info(f"Starting scrape: {url}")
try:
    data = scrape(url)
    logging.info(f"Success: {len(data)} items from {url}")
except Exception as e:
    logging.error(f"Failed: {url} - {e}")

9. Lưu Data Cuối Cùng

# ❌ Bad - Mất hết data khi crash
all_data = []
for url in urls:
    data = scrape(url)
    all_data.append(data)
save(all_data)  # Crash here = lose everything

# ✅ Good - Save incrementally
for url in urls:
    data = scrape(url)
    save_to_db(data)  # Save immediately

10. Không Test Selectors

# ❌ Bad - Deploy và hope for the best

# ✅ Good - Test selectors trước
def test_selectors():
    response = requests.get(test_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    assert soup.select('.product'), "Products not found"
    assert soup.select('.price'), "Prices not found"
    assert soup.select('.name'), "Names not found"
    
    print("All selectors working!")

test_selectors()

Summary

Mistake Fix
No error handling try/except everywhere
Fragile selectors Use data-testid, fallbacks
No timeout Always set timeout
Too fast Add random delays
No validation Validate before save

VinaProxy + Avoid Mistakes

  • Reliable proxy để tránh blocks
  • Professional scraping setup
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →