Export Dữ Liệu Scraping: JSON, CSV, Excel, Database
Sau khi scrape, cần export data để sử dụng. Bài viết hướng dẫn export scraped data sang các formats phổ biến.
Các Format Phổ Biến
- JSON: Flexible, nested data
- CSV: Simple, spreadsheet-friendly
- Excel: Formatted tables, charts
- SQLite: Local database
- PostgreSQL/MySQL: Production database
Export JSON
import json
data = [
{'name': 'Product 1', 'price': 100000, 'url': 'https://...'},
{'name': 'Product 2', 'price': 200000, 'url': 'https://...'},
]
# Pretty print
with open('products.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
# Compact (smaller file)
with open('products_compact.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, separators=(',', ':'))
Export CSV
import csv
data = [
{'name': 'Product 1', 'price': 100000, 'category': 'Electronics'},
{'name': 'Product 2', 'price': 200000, 'category': 'Fashion'},
]
# Dict writer (recommended)
with open('products.csv', 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
# Read back
with open('products.csv', 'r', encoding='utf-8-sig') as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
Export Excel
import pandas as pd
data = [
{'name': 'Product 1', 'price': 100000},
{'name': 'Product 2', 'price': 200000},
]
df = pd.DataFrame(data)
# Simple export
df.to_excel('products.xlsx', index=False)
# With formatting
with pd.ExcelWriter('products_formatted.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, sheet_name='Products', index=False)
# Get workbook and worksheet
workbook = writer.book
worksheet = writer.sheets['Products']
# Auto-adjust column widths
for column in worksheet.columns:
max_length = max(len(str(cell.value)) for cell in column)
worksheet.column_dimensions[column[0].column_letter].width = max_length + 2
Export SQLite
import sqlite3
data = [
{'name': 'Product 1', 'price': 100000},
{'name': 'Product 2', 'price': 200000},
]
conn = sqlite3.connect('products.db')
cursor = conn.cursor()
# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT,
price INTEGER,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
''')
# Insert data
for item in data:
cursor.execute(
'INSERT INTO products (name, price) VALUES (?, ?)',
(item['name'], item['price'])
)
conn.commit()
conn.close()
Export PostgreSQL
import psycopg2
conn = psycopg2.connect(
host='localhost',
database='scraping_db',
user='user',
password='password'
)
cursor = conn.cursor()
# Bulk insert
data = [(item['name'], item['price']) for item in products]
cursor.executemany(
'INSERT INTO products (name, price) VALUES (%s, %s)',
data
)
conn.commit()
conn.close()
Pandas Cho Mọi Format
import pandas as pd
df = pd.DataFrame(scraped_data)
# Multiple exports
df.to_json('data.json', orient='records', force_ascii=False)
df.to_csv('data.csv', index=False)
df.to_excel('data.xlsx', index=False)
df.to_sql('products', conn, if_exists='append', index=False)
Best Practices
- Dùng UTF-8 encoding cho Vietnamese
- Include timestamps
- Validate data trước khi export
- Backup regularly
VinaProxy + Data Export
- Scrape → Clean → Export workflow
- Reliable data collection
- Giá chỉ $0.5/GB
