Xử Lý Pagination Khi Web Scraping: Các Kỹ Thuật

Trở lại Tin tức
Tin tức

Xử Lý Pagination Khi Web Scraping: Các Kỹ Thuật

Hầu hết websites chia data thành nhiều pages. Bài viết hướng dẫn xử lý pagination với các patterns phổ biến.

Các Loại Pagination

1. Page Numbers

https://example.com/products?page=1
https://example.com/products?page=2

2. Offset-based

https://example.com/api/items?offset=0&limit=20
https://example.com/api/items?offset=20&limit=20

3. Cursor-based

https://example.com/api/items?cursor=abc123
https://example.com/api/items?cursor=def456

4. Infinite Scroll

Load more khi scroll xuống – cần JavaScript handling.

Scrape Page Numbers

import requests
from bs4 import BeautifulSoup

base_url = "https://example.com/products?page={}"
all_products = []

for page in range(1, 11):  # Pages 1-10
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    products = soup.select('.product')
    if not products:
        break  # No more products
    
    all_products.extend(products)
    print(f"Page {page}: {len(products)} products")

print(f"Total: {len(all_products)}")

Scrape Offset API

import requests

url = "https://example.com/api/products"
limit = 50
offset = 0
all_items = []

while True:
    response = requests.get(url, params={
        'offset': offset,
        'limit': limit
    })
    data = response.json()
    
    items = data.get('items', [])
    if not items:
        break
    
    all_items.extend(items)
    offset += limit
    
    print(f"Fetched {len(all_items)} items...")

print(f"Total: {len(all_items)}")

Scrape Cursor Pagination

import requests

url = "https://example.com/api/products"
cursor = None
all_items = []

while True:
    params = {'limit': 50}
    if cursor:
        params['cursor'] = cursor
    
    response = requests.get(url, params=params)
    data = response.json()
    
    all_items.extend(data['items'])
    
    cursor = data.get('next_cursor')
    if not cursor:
        break

print(f"Total: {len(all_items)}")

Handle Infinite Scroll

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com/products")
    
    # Scroll until no new content
    prev_height = 0
    while True:
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)
        
        new_height = page.evaluate("document.body.scrollHeight")
        if new_height == prev_height:
            break
        prev_height = new_height
    
    # Now extract all loaded products
    products = page.query_selector_all('.product')

Tips

  • Detect end condition (empty results, same content)
  • Add delays giữa page requests
  • Save progress để resume nếu bị interrupt
  • Parallel fetch nếu không bị rate limit

VinaProxy + Pagination

  • Scrape nhiều pages không bị block
  • Rotate IPs cho large crawls
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →