Tối Ưu Bandwidth Proxy: Tiết Kiệm Chi Phí Scraping

Proxy tính tiền theo GB. Bài viết hướng dẫn tối ưu bandwidth để giảm chi phí.

Tại Sao Bandwidth Quan Trọng?

Residential proxy tính per GB
100GB × $0.5 = $50/tháng
Tối ưu 50% = tiết kiệm $25/tháng

1. Chỉ Lấy Dữ Liệu Cần Thiết

# ❌ Bad: Tải cả page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
price = soup.select_one('.price').text

# ✅ Good: Dùng API nếu có (ít data hơn)
response = requests.get(api_url)
price = response.json()['price']

2. Block Resources Không Cần

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    
    # Block images, CSS, fonts
    page.route('**/*.{png,jpg,jpeg,gif,svg,css,woff,woff2}', 
               lambda route: route.abort())
    
    # Block tracking scripts
    page.route('**/analytics*', lambda route: route.abort())
    page.route('**/tracking*', lambda route: route.abort())
    
    page.goto(url)
    # Chỉ load HTML và JS cần thiết

3. HEAD Request Trước

# Kiểm tra page có thay đổi không trước khi tải
def fetch_if_changed(url, last_etag=None):
    # HEAD request rất nhẹ
    head = requests.head(url, proxies=proxies)
    current_etag = head.headers.get('ETag')
    
    if current_etag == last_etag:
        print("Không đổi, skip")
        return None, current_etag
    
    # Chỉ GET khi cần
    response = requests.get(url, proxies=proxies)
    return response, current_etag

4. Compression

# Request compressed response
headers = {
    'Accept-Encoding': 'gzip, deflate, br'
}

response = requests.get(url, headers=headers, proxies=proxies)
# Response tự động decompress, nhưng bandwidth = compressed size

5. Caching

import hashlib
import os

CACHE_DIR = 'cache'

def cached_request(url, proxies, max_age=3600):
    # Check cache first
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_file = f"{CACHE_DIR}/{cache_key}"
    
    if os.path.exists(cache_file):
        age = time.time() - os.path.getmtime(cache_file)
        if age < max_age:
            with open(cache_file, 'r') as f:
                return f.read()
    
    # Chỉ dùng proxy khi cache miss
    response = requests.get(url, proxies=proxies)
    
    with open(cache_file, 'w') as f:
        f.write(response.text)
    
    return response.text

6. Sitemap Thay Vì Crawl

# ❌ Bad: Crawl toàn bộ site để tìm URLs
# Tốn bandwidth cho navigation pages

# ✅ Good: Parse sitemap (1 request)
sitemap = requests.get('https://site.com/sitemap.xml', proxies=proxies)
urls = extract_urls(sitemap.text)
# Chỉ scrape product pages

7. Selective Field Extraction

# Một số APIs cho phép chọn fields
# GraphQL
query = '''
query {
    products {
        name
        price
    }
}
'''
# Chỉ lấy name và price, không lấy description dài

8. Skip Đã Scrape

scraped_urls = load_scraped_list()

for url in urls:
    if url in scraped_urls:
        continue  # 0 bandwidth
    
    response = requests.get(url, proxies=proxies)
    save_scraped_url(url)

Bandwidth Calculator

# Estimate bandwidth usage
def estimate_bandwidth(url_count, avg_page_size_kb):
    total_mb = url_count * avg_page_size_kb / 1024
    total_gb = total_mb / 1024
    cost = total_gb * 0.5  # VinaProxy rate
    
    print(f"URLs: {url_count}")
    print(f"Est. bandwidth: {total_gb:.2f} GB")
    print(f"Est. cost: ${cost:.2f}")

# Example: 10,000 product pages × 100KB each
estimate_bandwidth(10000, 100)
# URLs: 10000
# Est. bandwidth: 0.98 GB
# Est. cost: $0.49

Monitoring Usage

total_bytes = 0

def tracked_request(url, proxies):
    global total_bytes
    response = requests.get(url, proxies=proxies)
    total_bytes += len(response.content)
    return response

# Check usage
print(f"Used: {total_bytes / 1024 / 1024:.2f} MB")

Tóm Tắt

Technique	Savings
Block images/CSS	50-80%
Compression	60-70%
Caching	Variable
Skip scraped	100% per skip

VinaProxy Bandwidth

Pay only for what you use
Dashboard tracking real-time
Giá chỉ $0.5/GB

Tiết Kiệm Ngay →

Tối Ưu Bandwidth Proxy: Tiết Kiệm Chi Phí Scraping

Tại Sao Bandwidth Quan Trọng?

1. Chỉ Lấy Dữ Liệu Cần Thiết

2. Block Resources Không Cần

3. HEAD Request Trước

4. Compression

5. Caching

6. Sitemap Thay Vì Crawl

7. Selective Field Extraction

8. Skip Đã Scrape

Bandwidth Calculator

Monitoring Usage

Tóm Tắt

VinaProxy Bandwidth

admin