Lưu Trữ Dữ Liệu Web Scraping: JSON, CSV, Database
Sau khi scrape xong, dữ liệu cần được lưu trữ hợp lý. Bài viết so sánh các phương pháp storage phổ biến.
So Sánh Các Phương Pháp
| Format | Best For | Size | Query |
|---|---|---|---|
| JSON | Nested data | Lớn | Khó |
| CSV | Tabular data | Nhỏ | OK |
| SQLite | Local DB | Trung bình | Tốt |
| PostgreSQL | Production | Tối ưu | Tốt nhất |
| MongoDB | Flexible schema | Lớn | Tốt |
1. Lưu JSON
import json
data = [
{'name': 'Product 1', 'price': 100},
{'name': 'Product 2', 'price': 200}
]
# Ghi file
with open('products.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Đọc file
with open('products.json', 'r', encoding='utf-8') as f:
loaded = json.load(f)
2. Lưu CSV
import csv
data = [
{'name': 'Product 1', 'price': 100},
{'name': 'Product 2', 'price': 200}
]
# Ghi CSV
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'price'])
writer.writeheader()
writer.writerows(data)
# Đọc CSV
with open('products.csv', 'r', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
3. SQLite (Đơn Giản)
import sqlite3
conn = sqlite3.connect('products.db')
cursor = conn.cursor()
# Tạo table
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
name TEXT,
price REAL,
url TEXT
)
''')
# Insert data
cursor.execute('INSERT INTO products (name, price, url) VALUES (?, ?, ?)',
('Product 1', 100, 'https://example.com/1'))
conn.commit()
conn.close()
4. PostgreSQL (Production)
import psycopg2
conn = psycopg2.connect(
host="localhost",
database="scraping",
user="user",
password="pass"
)
cursor = conn.cursor()
cursor.execute('''
INSERT INTO products (name, price, scraped_at)
VALUES (%s, %s, NOW())
''', ('Product 1', 100))
conn.commit()
conn.close()
5. MongoDB (Flexible)
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['scraping']
collection = db['products']
# Insert
collection.insert_one({
'name': 'Product 1',
'price': 100,
'variants': ['red', 'blue']
})
# Query
products = collection.find({'price': {'$gt': 50}})
Khi Nào Dùng Gì?
- JSON: Quick export, nested data, APIs
- CSV: Excel analysis, simple data
- SQLite: Local projects, single user
- PostgreSQL: Production, multi-user, scale
- MongoDB: Varied schemas, documents
VinaProxy + Data Pipeline
- Scrape → Clean → Store workflow
- Proxy rotation cho large datasets
- Giá chỉ $0.5/GB
