Debug Web Scraper: Tìm Và Sửa Lỗi Nhanh
Scraper không hoạt động? Bài viết hướng dẫn debug và fix các lỗi phổ biến.
Lỗi Phổ Biến
1. Empty Results
# Problem: Selector không tìm thấy gì
products = soup.select('.product-item')
print(len(products)) # 0
# Debug steps:
# 1. Print raw HTML
print(response.text[:2000])
# 2. Check nếu page redirect
print(response.url)
# 3. Check status code
print(response.status_code)
2. Blocked/403 Error
# Solution 1: Add headers
headers = {
'User-Agent': 'Mozilla/5.0...',
'Accept': 'text/html...',
}
# Solution 2: Use proxy
proxies = {'http': 'http://proxy.vinaproxy.com:8080'}
# Solution 3: Add delays
import time
time.sleep(2)
3. JavaScript Content Missing
# Problem: Data rendered by JS
# requests chỉ lấy HTML tĩnh
# Solution: Dùng Selenium/Playwright
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_selector('.product')
html = page.content()
Debug Tools
Print Intermediate Results
response = requests.get(url)
print(f"Status: {response.status_code}")
print(f"URL: {response.url}")
print(f"Headers: {response.headers}")
print(f"Content length: {len(response.text)}")
print(f"First 500 chars:\n{response.text[:500]}")
Save HTML For Inspection
with open('debug.html', 'w') as f:
f.write(response.text)
# Mở trong browser để inspect
Breakpoints
import pdb
response = requests.get(url)
pdb.set_trace() # Dừng ở đây để inspect
soup = BeautifulSoup(response.text, 'lxml')
Logging
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.debug(f"Fetching: {url}")
logger.info(f"Found {len(products)} products")
logger.warning(f"No price found for {product_id}")
logger.error(f"Failed to fetch: {url}")
Common Fixes Checklist
- ☐ Check selector trong browser DevTools
- ☐ Verify URL đúng
- ☐ Add proper headers
- ☐ Check for redirects
- ☐ Try with different proxy
- ☐ Use browser automation nếu cần JS
VinaProxy – Debug Faster
- Test với nhiều IPs khác nhau
- Isolate IP issues
- Giá chỉ $0.5/GB
