Playwright Cho Web Scraping: Tốt Hơn Selenium?
Playwright là framework browser automation mới từ Microsoft. So sánh với Selenium và hướng dẫn dùng cho web scraping.
Playwright vs Selenium
| Feature | Playwright | Selenium |
|---|---|---|
| Speed | Nhanh hơn | Chậm hơn |
| Auto-wait | Có (built-in) | Manual |
| Multi-browser | Chromium, Firefox, WebKit | Cần driver riêng |
| Headless | Default | Cần config |
| API | Modern async | Legacy sync |
Cài Đặt
# Python
pip install playwright
playwright install
# Node.js
npm install playwright
Python Example
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
# Auto-wait cho selector
title = page.locator("h1").text_content()
print(title)
# Screenshot
page.screenshot(path="screenshot.png")
browser.close()
Async Version
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://example.com")
# Multiple elements
links = await page.locator("a").all()
for link in links:
href = await link.get_attribute("href")
print(href)
await browser.close()
asyncio.run(main())
Playwright Với Proxy
browser = p.chromium.launch(
headless=True,
proxy={
"server": "http://proxy.vinaproxy.com:8080",
"username": "user",
"password": "pass"
}
)
Stealth Mode
# Cài playwright-stealth
pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
stealth_sync(page) # Apply stealth
page.goto("https://protected-site.com")
Ưu Điểm Playwright
- Auto-wait: Không cần explicit waits
- Network interception: Block images, modify requests
- Multiple contexts: Parallel browsing
- Tracing: Debug với trace viewer
- Video recording: Record sessions
Khi Nào Dùng Playwright?
- JavaScript-heavy websites
- SPA (React, Vue, Angular)
- Sites cần login/interaction
- Khi Selenium quá chậm
VinaProxy + Playwright
- Bypass geo-restrictions
- Rotate IPs cho large-scale scraping
- Giá chỉ $0.5/GB
