Tối Ưu Bandwidth Proxy: Tiết Kiệm Chi Phí Scraping
Proxy tính tiền theo GB. Bài viết hướng dẫn tối ưu bandwidth để giảm chi phí.
Tại Sao Bandwidth Quan Trọng?
- Residential proxy tính per GB
- 100GB × $0.5 = $50/tháng
- Tối ưu 50% = tiết kiệm $25/tháng
1. Chỉ Lấy Dữ Liệu Cần Thiết
# ❌ Bad: Tải cả page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
price = soup.select_one('.price').text
# ✅ Good: Dùng API nếu có (ít data hơn)
response = requests.get(api_url)
price = response.json()['price']
2. Block Resources Không Cần
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Block images, CSS, fonts
page.route('**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}',
lambda route: route.abort())
# Block tracking scripts
page.route('**/analytics*', lambda route: route.abort())
page.route('**/tracking*', lambda route: route.abort())
page.goto(url)
# Chỉ load HTML và JS cần thiết
3. HEAD Request Trước
# Kiểm tra page có thay đổi không trước khi tải
def fetch_if_changed(url, last_etag=None):
# HEAD request rất nhẹ
head = requests.head(url, proxies=proxies)
current_etag = head.headers.get('ETag')
if current_etag == last_etag:
print("Không đổi, skip")
return None, current_etag
# Chỉ GET khi cần
response = requests.get(url, proxies=proxies)
return response, current_etag
4. Compression
# Request compressed response
headers = {
'Accept-Encoding': 'gzip, deflate, br'
}
response = requests.get(url, headers=headers, proxies=proxies)
# Response tự động decompress, nhưng bandwidth = compressed size
5. Caching
import hashlib
import os
CACHE_DIR = 'cache'
def cached_request(url, proxies, max_age=3600):
# Check cache first
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_file = f"{CACHE_DIR}/{cache_key}"
if os.path.exists(cache_file):
age = time.time() - os.path.getmtime(cache_file)
if age < max_age:
with open(cache_file, 'r') as f:
return f.read()
# Chỉ dùng proxy khi cache miss
response = requests.get(url, proxies=proxies)
with open(cache_file, 'w') as f:
f.write(response.text)
return response.text
6. Sitemap Thay Vì Crawl
# ❌ Bad: Crawl toàn bộ site để tìm URLs
# Tốn bandwidth cho navigation pages
# ✅ Good: Parse sitemap (1 request)
sitemap = requests.get('https://site.com/sitemap.xml', proxies=proxies)
urls = extract_urls(sitemap.text)
# Chỉ scrape product pages
7. Selective Field Extraction
# Một số APIs cho phép chọn fields
# GraphQL
query = '''
query {
products {
name
price
}
}
'''
# Chỉ lấy name và price, không lấy description dài
8. Skip Đã Scrape
scraped_urls = load_scraped_list()
for url in urls:
if url in scraped_urls:
continue # 0 bandwidth
response = requests.get(url, proxies=proxies)
save_scraped_url(url)
Bandwidth Calculator
# Estimate bandwidth usage
def estimate_bandwidth(url_count, avg_page_size_kb):
total_mb = url_count * avg_page_size_kb / 1024
total_gb = total_mb / 1024
cost = total_gb * 0.5 # VinaProxy rate
print(f"URLs: {url_count}")
print(f"Est. bandwidth: {total_gb:.2f} GB")
print(f"Est. cost: ${cost:.2f}")
# Example: 10,000 product pages × 100KB each
estimate_bandwidth(10000, 100)
# URLs: 10000
# Est. bandwidth: 0.98 GB
# Est. cost: $0.49
Monitoring Usage
total_bytes = 0
def tracked_request(url, proxies):
global total_bytes
response = requests.get(url, proxies=proxies)
total_bytes += len(response.content)
return response
# Check usage
print(f"Used: {total_bytes / 1024 / 1024:.2f} MB")
Tóm Tắt
| Technique | Savings |
|---|---|
| Block images/CSS | 50-80% |
| Compression | 60-70% |
| Caching | Variable |
| Skip scraped | 100% per skip |
VinaProxy Bandwidth
- Pay only for what you use
- Dashboard tracking real-time
- Giá chỉ $0.5/GB
