Error Handling Trong Web Scraping: Xử Lý Mọi Tình Huống
Scraping production cần error handling robust. Bài viết cover các lỗi phổ biến và cách xử lý.
Các Lỗi Thường Gặp
Network Errors
- ConnectionError: Không connect được
- Timeout: Request quá lâu
- SSLError: Certificate issues
HTTP Errors
- 403: Forbidden (bị block)
- 404: Page not found
- 429: Rate limited
- 500: Server error
- 503: Service unavailable
Parsing Errors
- Element not found: Selector sai
- AttributeError: None.text
- JSONDecodeError: Invalid JSON
Basic Error Handling
import requests
from requests.exceptions import RequestException
def safe_request(url, max_retries=3):
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response
except RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
return None
Comprehensive Handler
import requests
from requests.exceptions import (
ConnectionError, Timeout, SSLError,
HTTPError, RequestException
)
def fetch_with_handling(url):
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
return response
except ConnectionError:
print("Cannot connect to server")
except Timeout:
print("Request timed out")
except SSLError:
print("SSL certificate error")
except HTTPError as e:
if e.response.status_code == 403:
print("Blocked! Try different proxy")
elif e.response.status_code == 429:
print("Rate limited! Slow down")
elif e.response.status_code == 404:
print("Page not found")
else:
print(f"HTTP error: {e.response.status_code}")
except RequestException as e:
print(f"Request failed: {e}")
return None
Safe Parsing
from bs4 import BeautifulSoup
def safe_extract(soup, selector, attr=None, default=''):
element = soup.select_one(selector)
if not element:
return default
if attr:
return element.get(attr, default)
return element.text.strip()
# Usage
soup = BeautifulSoup(html, 'lxml')
title = safe_extract(soup, '.product-title')
price = safe_extract(soup, '.price', default='N/A')
url = safe_extract(soup, 'a.link', attr='href')
Logging Errors
import logging
logging.basicConfig(
filename='scraper.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
def scrape_url(url):
try:
response = requests.get(url)
logging.info(f"Success: {url}")
return response
except Exception as e:
logging.error(f"Failed: {url} - {e}")
return None
Best Practices
- Always set timeouts
- Log errors với context
- Retry với exponential backoff
- Save progress để resume
- Alert khi error rate cao
VinaProxy – Giảm Errors
- Reliable proxy infrastructure
- Auto-retry on failures
- Giá chỉ $0.5/GB
