10 Lỗi Web Scraping Phổ Biến Và Cách Tránh

Học từ mistakes của người khác. Bài viết liệt kê 10 lỗi scraping phổ biến và cách fix.

1. Không Có Error Handling

# ❌ Bad
response = requests.get(url)
data = response.json()  # Crashes if not JSON

# ✅ Good
try:
    response = requests.get(url, timeout=30)
    response.raise_for_status()
    data = response.json()
except requests.RequestException as e:
    logging.error(f"Request failed: {e}")
    data = None
except json.JSONDecodeError:
    logging.error("Invalid JSON response")
    data = None

2. Hardcoded Selectors

# ❌ Bad - Breaks when class changes
price = soup.select_one('.sc-fzXfQW.bGYYPQ').text

# ✅ Good - More robust selectors
price = soup.select_one('[data-testid="price"]').text
# Or use multiple fallbacks
price = (soup.select_one('.price') or 
         soup.select_one('[itemprop="price"]') or
         soup.select_one('.product-price'))

3. Không Set Timeout

# ❌ Bad - Có thể hang mãi mãi
response = requests.get(url)

# ✅ Good - Always set timeout
response = requests.get(url, timeout=30)

# Or with connect and read timeouts
response = requests.get(url, timeout=(5, 30))

4. Ignore robots.txt

# ❌ Bad - Scrape mọi thứ không check

# ✅ Good - Check robots.txt first
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()

if rp.can_fetch('*', url):
    scrape(url)
else:
    print(f"Blocked by robots.txt: {url}")

5. Scrape Quá Nhanh

# ❌ Bad - No delay, will get blocked
for url in urls:
    scrape(url)

# ✅ Good - Add delays
import time
import random

for url in urls:
    scrape(url)
    time.sleep(random.uniform(1, 3))  # Random 1-3 seconds

6. Không Dùng Session

# ❌ Bad - New connection mỗi request
for url in urls:
    response = requests.get(url)

# ✅ Good - Reuse session (connection pooling)
session = requests.Session()
for url in urls:
    response = session.get(url)

7. Không Validate Data

# ❌ Bad - Save whatever we get
data = {'price': price_text}
save(data)

# ✅ Good - Validate before saving
def validate_product(data):
    required = ['name', 'price', 'url']
    for field in required:
        if not data.get(field):
            return False
    if data['price'] <= 0:
        return False
    return True

if validate_product(data):
    save(data)
else:
    logging.warning(f"Invalid data: {data}")

8. Không Log Đủ

# ❌ Bad - No logging
scrape(url)

# ✅ Good - Comprehensive logging
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    filename='scraper.log'
)

logging.info(f"Starting scrape: {url}")
try:
    data = scrape(url)
    logging.info(f"Success: {len(data)} items from {url}")
except Exception as e:
    logging.error(f"Failed: {url} - {e}")

9. Lưu Data Cuối Cùng

# ❌ Bad - Mất hết data khi crash
all_data = []
for url in urls:
    data = scrape(url)
    all_data.append(data)
save(all_data)  # Crash here = lose everything

# ✅ Good - Save incrementally
for url in urls:
    data = scrape(url)
    save_to_db(data)  # Save immediately

10. Không Test Selectors

# ❌ Bad - Deploy và hope for the best

# ✅ Good - Test selectors trước
def test_selectors():
    response = requests.get(test_url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    assert soup.select('.product'), "Products not found"
    assert soup.select('.price'), "Prices not found"
    assert soup.select('.name'), "Names not found"
    
    print("All selectors working!")

test_selectors()

Summary

Mistake	Fix
No error handling	try/except everywhere
Fragile selectors	Use data-testid, fallbacks
No timeout	Always set timeout
Too fast	Add random delays
No validation	Validate before save

VinaProxy + Avoid Mistakes

Reliable proxy để tránh blocks
Professional scraping setup
Giá chỉ $0.5/GB

Dùng Thử Ngay →

10 Lỗi Web Scraping Phổ Biến Và Cách Tránh

1. Không Có Error Handling

2. Hardcoded Selectors

3. Không Set Timeout

4. Ignore robots.txt

5. Scrape Quá Nhanh

6. Không Dùng Session

7. Không Validate Data

8. Không Log Đủ

9. Lưu Data Cuối Cùng

10. Không Test Selectors

Summary

VinaProxy + Avoid Mistakes

admin