Bảo Mật Proxy: Best Practices Cho Web Scraping An Toàn
Dùng proxy cần chú ý bảo mật. Bài viết hướng dẫn sử dụng proxy an toàn.
1. Bảo Vệ Credentials
# ❌ Bad: Hardcode trong code
proxy = 'http://user123:secretpass@proxy.vinaproxy.com:8080'
# ✅ Good: Environment variables
import os
proxy = os.getenv('PROXY_URL')
# .env file (KHÔNG commit lên git)
PROXY_URL=http://user123:secretpass@proxy.vinaproxy.com:8080
# .gitignore
.env
*.env
2. Rotate Credentials Định Kỳ
- Đổi password 30-90 ngày/lần
- Revoke credentials không dùng
- Tạo credentials riêng cho mỗi project
3. Giới Hạn IP Whitelist
# Nhiều proxy providers cho phép IP whitelist
# Chỉ server IPs của bạn mới dùng được proxy
# Ví dụ: Chỉ cho phép
# - 203.0.113.10 (production server)
# - 203.0.113.20 (staging server)
4. HTTPS Mọi Nơi
# ✅ Luôn dùng HTTPS cho target URLs
response = requests.get('https://example.com', proxies=proxies)
# Proxy connection cũng nên encrypted
# Nhiều providers hỗ trợ HTTPS proxy endpoint
5. Không Log Sensitive Data
# ❌ Bad: Log full proxy URL
logging.info(f"Using proxy: {proxy_url}")
# ✅ Good: Mask credentials
def mask_proxy(url):
import re
return re.sub(r'://[^:]+:[^@]+@', '://***:***@', url)
logging.info(f"Using proxy: {mask_proxy(proxy_url)}")
6. Validate SSL Certificates
# ❌ Bad: Disable SSL verification
response = requests.get(url, verify=False) # DANGEROUS
# ✅ Good: Proper certificate handling
response = requests.get(url, verify=True) # Default
# Nếu cần custom CA:
response = requests.get(url, verify='/path/to/ca-bundle.crt')
7. Xử Lý Data An Toàn
# Encrypt scraped data at rest
from cryptography.fernet import Fernet
key = os.getenv('ENCRYPTION_KEY')
cipher = Fernet(key)
# Encrypt trước khi lưu
encrypted = cipher.encrypt(scraped_data.encode())
save_to_db(encrypted)
# Decrypt khi cần
decrypted = cipher.decrypt(encrypted).decode()
8. Network Isolation
# Docker với network riêng
# docker-compose.yml
services:
scraper:
networks:
- scraper-net
environment:
- PROXY_URL=${PROXY_URL}
networks:
scraper-net:
driver: bridge
9. Audit Logging
import logging
from datetime import datetime
def audit_log(action, url, status):
logging.info(f"{datetime.utcnow()} | {action} | {url} | {status}")
# Track mọi request
audit_log('SCRAPE', url, response.status_code)
10. Rate Limiting
from ratelimit import limits, sleep_and_retry
# Max 100 requests per minute
@sleep_and_retry
@limits(calls=100, period=60)
def safe_request(url, proxies):
return requests.get(url, proxies=proxies)
# Tránh abuse proxy và target site
Security Checklist
□ Credentials trong env vars, không hardcode
□ .env trong .gitignore
□ IP whitelist enabled (nếu có)
□ HTTPS cho tất cả requests
□ SSL verification ON
□ Logs không chứa credentials
□ Data encrypted at rest
□ Regular credential rotation
□ Audit logging enabled
□ Rate limiting configured
Xử Lý Breach
- Revoke credentials ngay lập tức
- Generate credentials mới
- Review access logs
- Update tất cả applications
- Check for unauthorized usage
VinaProxy Security
- Secure authentication
- IP whitelist support
- Usage monitoring
- Giá chỉ $0.5/GB
