Data Cleaning Sau Khi Scrape: Làm Sạch Dữ Liệu
Data scrape thường “bẩn” – có whitespace, duplicates, formats không đồng nhất. Bài viết hướng dẫn clean data hiệu quả.
Vấn Đề Phổ Biến
- Extra whitespace và newlines
- Duplicate entries
- Inconsistent formats (date, price)
- Missing values
- HTML entities (&, <, etc.)
Basic Text Cleaning
# Remove whitespace
text = " Hello World \n"
clean = text.strip() # "Hello World"
clean = " ".join(text.split()) # Normalize spaces
# Remove newlines
clean = text.replace('\n', ' ').replace('\r', '')
# Remove HTML entities
import html
text = "Price: $100 & more"
clean = html.unescape(text) # "Price: $100 & more"
Normalize Prices
import re
def clean_price(price_str):
# Remove currency symbols, commas
clean = re.sub(r'[^\d.]', '', price_str)
return float(clean) if clean else 0.0
# Examples
clean_price("$1,234.56") # 1234.56
clean_price("1.234.567 đ") # 1234567.0
clean_price("VND 99,000") # 99000.0
Normalize Dates
from datetime import datetime
def parse_date(date_str):
formats = [
'%Y-%m-%d',
'%d/%m/%Y',
'%d-%m-%Y',
'%B %d, %Y',
]
for fmt in formats:
try:
return datetime.strptime(date_str.strip(), fmt)
except ValueError:
continue
return None
# Examples
parse_date("2026-01-15") # datetime object
parse_date("15/01/2026") # datetime object
Remove Duplicates
# List of dicts
data = [
{'id': 1, 'name': 'A'},
{'id': 1, 'name': 'A'}, # duplicate
{'id': 2, 'name': 'B'},
]
# Remove duplicates
seen = set()
unique = []
for item in data:
key = item['id']
if key not in seen:
seen.add(key)
unique.append(item)
# Or with pandas
import pandas as pd
df = pd.DataFrame(data)
df = df.drop_duplicates(subset=['id'])
Handle Missing Values
# Default values
name = item.get('name') or 'Unknown'
price = item.get('price') or 0
# Filter out incomplete
complete = [item for item in data if item.get('name') and item.get('price')]
# With pandas
df = df.dropna() # Remove rows with NaN
df = df.fillna({'price': 0, 'name': 'Unknown'})
Pandas Pipeline
import pandas as pd
df = pd.DataFrame(raw_data)
# Clean pipeline
df['name'] = df['name'].str.strip()
df['price'] = df['price'].apply(clean_price)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.drop_duplicates(subset=['id'])
df = df.dropna(subset=['name', 'price'])
# Export
df.to_csv('clean_data.csv', index=False)
VinaProxy + Clean Data Pipeline
- Scrape raw → Clean → Store
- Reliable data extraction
- Giá chỉ $0.5/GB
