Xử Lý Pagination Khi Web Scraping: Các Kỹ Thuật
Hầu hết websites chia data thành nhiều pages. Bài viết hướng dẫn xử lý pagination với các patterns phổ biến.
Các Loại Pagination
1. Page Numbers
https://example.com/products?page=1
https://example.com/products?page=2
2. Offset-based
https://example.com/api/items?offset=0&limit=20
https://example.com/api/items?offset=20&limit=20
3. Cursor-based
https://example.com/api/items?cursor=abc123
https://example.com/api/items?cursor=def456
4. Infinite Scroll
Load more khi scroll xuống – cần JavaScript handling.
Scrape Page Numbers
import requests
from bs4 import BeautifulSoup
base_url = "https://example.com/products?page={}"
all_products = []
for page in range(1, 11): # Pages 1-10
url = base_url.format(page)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
products = soup.select('.product')
if not products:
break # No more products
all_products.extend(products)
print(f"Page {page}: {len(products)} products")
print(f"Total: {len(all_products)}")
Scrape Offset API
import requests
url = "https://example.com/api/products"
limit = 50
offset = 0
all_items = []
while True:
response = requests.get(url, params={
'offset': offset,
'limit': limit
})
data = response.json()
items = data.get('items', [])
if not items:
break
all_items.extend(items)
offset += limit
print(f"Fetched {len(all_items)} items...")
print(f"Total: {len(all_items)}")
Scrape Cursor Pagination
import requests
url = "https://example.com/api/products"
cursor = None
all_items = []
while True:
params = {'limit': 50}
if cursor:
params['cursor'] = cursor
response = requests.get(url, params=params)
data = response.json()
all_items.extend(data['items'])
cursor = data.get('next_cursor')
if not cursor:
break
print(f"Total: {len(all_items)}")
Handle Infinite Scroll
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com/products")
# Scroll until no new content
prev_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
new_height = page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
break
prev_height = new_height
# Now extract all loaded products
products = page.query_selector_all('.product')
Tips
- Detect end condition (empty results, same content)
- Add delays giữa page requests
- Save progress để resume nếu bị interrupt
- Parallel fetch nếu không bị rate limit
VinaProxy + Pagination
- Scrape nhiều pages không bị block
- Rotate IPs cho large crawls
- Giá chỉ $0.5/GB
