Export Dữ Liệu Scraping: JSON, CSV, Excel, Database

Trở lại Tin tức
Tin tức

Export Dữ Liệu Scraping: JSON, CSV, Excel, Database

Sau khi scrape, cần export data để sử dụng. Bài viết hướng dẫn export scraped data sang các formats phổ biến.

Các Format Phổ Biến

  • JSON: Flexible, nested data
  • CSV: Simple, spreadsheet-friendly
  • Excel: Formatted tables, charts
  • SQLite: Local database
  • PostgreSQL/MySQL: Production database

Export JSON

import json

data = [
    {'name': 'Product 1', 'price': 100000, 'url': 'https://...'},
    {'name': 'Product 2', 'price': 200000, 'url': 'https://...'},
]

# Pretty print
with open('products.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# Compact (smaller file)
with open('products_compact.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, separators=(',', ':'))

Export CSV

import csv

data = [
    {'name': 'Product 1', 'price': 100000, 'category': 'Electronics'},
    {'name': 'Product 2', 'price': 200000, 'category': 'Fashion'},
]

# Dict writer (recommended)
with open('products.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.DictWriter(f, fieldnames=data[0].keys())
    writer.writeheader()
    writer.writerows(data)

# Read back
with open('products.csv', 'r', encoding='utf-8-sig') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

Export Excel

import pandas as pd

data = [
    {'name': 'Product 1', 'price': 100000},
    {'name': 'Product 2', 'price': 200000},
]

df = pd.DataFrame(data)

# Simple export
df.to_excel('products.xlsx', index=False)

# With formatting
with pd.ExcelWriter('products_formatted.xlsx', engine='openpyxl') as writer:
    df.to_excel(writer, sheet_name='Products', index=False)
    
    # Get workbook and worksheet
    workbook = writer.book
    worksheet = writer.sheets['Products']
    
    # Auto-adjust column widths
    for column in worksheet.columns:
        max_length = max(len(str(cell.value)) for cell in column)
        worksheet.column_dimensions[column[0].column_letter].width = max_length + 2

Export SQLite

import sqlite3

data = [
    {'name': 'Product 1', 'price': 100000},
    {'name': 'Product 2', 'price': 200000},
]

conn = sqlite3.connect('products.db')
cursor = conn.cursor()

# Create table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT,
        price INTEGER,
        scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    )
''')

# Insert data
for item in data:
    cursor.execute(
        'INSERT INTO products (name, price) VALUES (?, ?)',
        (item['name'], item['price'])
    )

conn.commit()
conn.close()

Export PostgreSQL

import psycopg2

conn = psycopg2.connect(
    host='localhost',
    database='scraping_db',
    user='user',
    password='password'
)
cursor = conn.cursor()

# Bulk insert
data = [(item['name'], item['price']) for item in products]
cursor.executemany(
    'INSERT INTO products (name, price) VALUES (%s, %s)',
    data
)

conn.commit()
conn.close()

Pandas Cho Mọi Format

import pandas as pd

df = pd.DataFrame(scraped_data)

# Multiple exports
df.to_json('data.json', orient='records', force_ascii=False)
df.to_csv('data.csv', index=False)
df.to_excel('data.xlsx', index=False)
df.to_sql('products', conn, if_exists='append', index=False)

Best Practices

  • Dùng UTF-8 encoding cho Vietnamese
  • Include timestamps
  • Validate data trước khi export
  • Backup regularly

VinaProxy + Data Export

  • Scrape → Clean → Export workflow
  • Reliable data collection
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →