Cookies Và Sessions Trong Web Scraping
Nhiều websites yêu cầu cookies/sessions để hoạt động. Bài viết hướng dẫn quản lý cookies khi scraping.
Tại Sao Cookies Quan Trọng?
- Authentication: Giữ login state
- Anti-bot: Track behavior
- Personalization: Location, language
- Cart/Wishlist: E-commerce sessions
Python Requests Sessions
import requests
# Tạo session - tự động quản lý cookies
session = requests.Session()
# Request đầu tiên - nhận cookies
session.get('https://example.com')
# Request tiếp theo - tự động gửi cookies
response = session.get('https://example.com/products')
# Xem cookies
print(session.cookies.get_dict())
Login Với Session
import requests
session = requests.Session()
# Get login page (CSRF token)
login_page = session.get('https://example.com/login')
# Extract CSRF token if needed
# csrf = extract_csrf(login_page.text)
# Login
login_data = {
'username': 'user@email.com',
'password': 'password123',
# 'csrf_token': csrf
}
response = session.post('https://example.com/login', data=login_data)
# Giờ có thể access protected pages
profile = session.get('https://example.com/profile')
print(profile.text)
Set Cookies Thủ Công
import requests
session = requests.Session()
# Set specific cookies
session.cookies.set('session_id', 'abc123')
session.cookies.set('user_token', 'xyz789')
# Hoặc từ dict
cookies = {
'session_id': 'abc123',
'language': 'vi',
'currency': 'VND'
}
response = session.get(url, cookies=cookies)
Export/Import Cookies
import requests
import pickle
session = requests.Session()
session.get('https://example.com')
# Save cookies
with open('cookies.pkl', 'wb') as f:
pickle.dump(session.cookies, f)
# Load cookies later
with open('cookies.pkl', 'rb') as f:
session.cookies.update(pickle.load(f))
Selenium Cookies
from selenium import webdriver
import json
driver = webdriver.Chrome()
driver.get('https://example.com')
# Get all cookies
cookies = driver.get_cookies()
print(cookies)
# Save cookies
with open('cookies.json', 'w') as f:
json.dump(cookies, f)
# Load cookies
with open('cookies.json', 'r') as f:
cookies = json.load(f)
for cookie in cookies:
driver.add_cookie(cookie)
Cookie Security Notes
- HttpOnly: Không access từ JS
- Secure: Chỉ gửi qua HTTPS
- SameSite: CSRF protection
- Expiry: Check cookie lifetime
Best Practices
- Dùng Session object, không single requests
- Save cookies để resume sessions
- Respect cookie expiry
- Rotate sessions khi bị block
VinaProxy + Session Management
- Sticky sessions với same IP
- Rotating sessions khi cần
- Giá chỉ $0.5/GB
