Error Handling Trong Web Scraping: Xử Lý Mọi Tình Huống

Trở lại Tin tức
Tin tức

Error Handling Trong Web Scraping: Xử Lý Mọi Tình Huống

Scraping production cần error handling robust. Bài viết cover các lỗi phổ biến và cách xử lý.

Các Lỗi Thường Gặp

Network Errors

  • ConnectionError: Không connect được
  • Timeout: Request quá lâu
  • SSLError: Certificate issues

HTTP Errors

  • 403: Forbidden (bị block)
  • 404: Page not found
  • 429: Rate limited
  • 500: Server error
  • 503: Service unavailable

Parsing Errors

  • Element not found: Selector sai
  • AttributeError: None.text
  • JSONDecodeError: Invalid JSON

Basic Error Handling

import requests
from requests.exceptions import RequestException

def safe_request(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response
        except RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    return None

Comprehensive Handler

import requests
from requests.exceptions import (
    ConnectionError, Timeout, SSLError,
    HTTPError, RequestException
)

def fetch_with_handling(url):
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        return response
    
    except ConnectionError:
        print("Cannot connect to server")
    except Timeout:
        print("Request timed out")
    except SSLError:
        print("SSL certificate error")
    except HTTPError as e:
        if e.response.status_code == 403:
            print("Blocked! Try different proxy")
        elif e.response.status_code == 429:
            print("Rate limited! Slow down")
        elif e.response.status_code == 404:
            print("Page not found")
        else:
            print(f"HTTP error: {e.response.status_code}")
    except RequestException as e:
        print(f"Request failed: {e}")
    
    return None

Safe Parsing

from bs4 import BeautifulSoup

def safe_extract(soup, selector, attr=None, default=''):
    element = soup.select_one(selector)
    if not element:
        return default
    
    if attr:
        return element.get(attr, default)
    return element.text.strip()

# Usage
soup = BeautifulSoup(html, 'lxml')
title = safe_extract(soup, '.product-title')
price = safe_extract(soup, '.price', default='N/A')
url = safe_extract(soup, 'a.link', attr='href')

Logging Errors

import logging

logging.basicConfig(
    filename='scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def scrape_url(url):
    try:
        response = requests.get(url)
        logging.info(f"Success: {url}")
        return response
    except Exception as e:
        logging.error(f"Failed: {url} - {e}")
        return None

Best Practices

  • Always set timeouts
  • Log errors với context
  • Retry với exponential backoff
  • Save progress để resume
  • Alert khi error rate cao

VinaProxy – Giảm Errors

  • Reliable proxy infrastructure
  • Auto-retry on failures
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →