Scrapy Framework: Web Scraping Quy Mô Lớn Với Python

Scrapy là framework web scraping mạnh nhất cho Python. Dùng để crawl hàng triệu pages một cách hiệu quả.

Tại Sao Dùng Scrapy?

Async built-in: Crawl nhiều pages đồng thời
Middlewares: Xử lý proxy, headers, retry
Pipelines: Clean và store data tự động
Shell: Test selectors nhanh

Cài Đặt

pip install scrapy

Tạo Project

scrapy startproject myproject
cd myproject
scrapy genspider products example.com

Spider Cơ Bản

# spiders/products.py
import scrapy

class ProductsSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']
    
    def parse(self, response):
        for product in response.css('.product-item'):
            yield {
                'name': product.css('.title::text').get(),
                'price': product.css('.price::text').get(),
                'url': product.css('a::attr(href)').get()
            }
        
        # Follow pagination
        next_page = response.css('a.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Chạy Spider

# Output JSON
scrapy crawl products -o products.json

# Output CSV
scrapy crawl products -o products.csv

Cấu Hình Proxy

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Trong spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url,
            meta={'proxy': 'http://user:pass@proxy.vinaproxy.com:8080'}
        )

Rotating Proxies

# Cài scrapy-rotating-proxies
pip install scrapy-rotating-proxies

# settings.py
ROTATING_PROXY_LIST = [
    'http://proxy1.vinaproxy.com:8080',
    'http://proxy2.vinaproxy.com:8080',
]

DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}

Scrapy vs Requests+BeautifulSoup

Feature	Scrapy	Requests+BS4
Scale	Millions pages	Hundreds
Speed	Async/parallel	Sequential
Setup	Phức tạp hơn	Đơn giản
Learning	Cần thời gian	Nhanh

Tips

Dùng scrapy shell để test selectors
Set DOWNLOAD_DELAY để tránh ban
Dùng AutoThrottle cho dynamic delays
Export data với Item Pipelines

VinaProxy + Scrapy

Rotating proxy pool cho large crawls
Residential IPs tránh detection
Giá chỉ $0.5/GB

Dùng Thử Ngay →

Scrapy Framework: Web Scraping Quy Mô Lớn Với Python

Tại Sao Dùng Scrapy?

Cài Đặt

Tạo Project

Spider Cơ Bản

Chạy Spider

Cấu Hình Proxy

Rotating Proxies

Scrapy vs Requests+BeautifulSoup

Tips

VinaProxy + Scrapy

admin