Lưu Trữ Dữ Liệu Web Scraping: JSON, CSV, Database

Trở lại Tin tức
Tin tức

Lưu Trữ Dữ Liệu Web Scraping: JSON, CSV, Database

Sau khi scrape xong, dữ liệu cần được lưu trữ hợp lý. Bài viết so sánh các phương pháp storage phổ biến.

So Sánh Các Phương Pháp

Format Best For Size Query
JSON Nested data Lớn Khó
CSV Tabular data Nhỏ OK
SQLite Local DB Trung bình Tốt
PostgreSQL Production Tối ưu Tốt nhất
MongoDB Flexible schema Lớn Tốt

1. Lưu JSON

import json

data = [
    {'name': 'Product 1', 'price': 100},
    {'name': 'Product 2', 'price': 200}
]

# Ghi file
with open('products.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

# Đọc file
with open('products.json', 'r', encoding='utf-8') as f:
    loaded = json.load(f)

2. Lưu CSV

import csv

data = [
    {'name': 'Product 1', 'price': 100},
    {'name': 'Product 2', 'price': 200}
]

# Ghi CSV
with open('products.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['name', 'price'])
    writer.writeheader()
    writer.writerows(data)

# Đọc CSV
with open('products.csv', 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

3. SQLite (Đơn Giản)

import sqlite3

conn = sqlite3.connect('products.db')
cursor = conn.cursor()

# Tạo table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY,
        name TEXT,
        price REAL,
        url TEXT
    )
''')

# Insert data
cursor.execute('INSERT INTO products (name, price, url) VALUES (?, ?, ?)',
               ('Product 1', 100, 'https://example.com/1'))

conn.commit()
conn.close()

4. PostgreSQL (Production)

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="scraping",
    user="user",
    password="pass"
)

cursor = conn.cursor()
cursor.execute('''
    INSERT INTO products (name, price, scraped_at)
    VALUES (%s, %s, NOW())
''', ('Product 1', 100))

conn.commit()
conn.close()

5. MongoDB (Flexible)

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['scraping']
collection = db['products']

# Insert
collection.insert_one({
    'name': 'Product 1',
    'price': 100,
    'variants': ['red', 'blue']
})

# Query
products = collection.find({'price': {'$gt': 50}})

Khi Nào Dùng Gì?

  • JSON: Quick export, nested data, APIs
  • CSV: Excel analysis, simple data
  • SQLite: Local projects, single user
  • PostgreSQL: Production, multi-user, scale
  • MongoDB: Varied schemas, documents

VinaProxy + Data Pipeline

  • Scrape → Clean → Store workflow
  • Proxy rotation cho large datasets
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →