Regex Trong Web Scraping: Extract Data Với Pattern Matching
Regular expressions (regex) là công cụ mạnh để extract patterns từ text. Bài viết hướng dẫn dùng regex cho scraping.
Khi Nào Dùng Regex?
- Extract emails, phones, URLs
- Parse structured text (không có HTML tags)
- Clean và normalize data
- Validate formats
Python re Module
import re
text = "Contact: email@example.com or call 0912-345-678"
# Find all emails
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print(emails) # ['email@example.com']
# Find all phone numbers
phones = re.findall(r'\d{4}-\d{3}-\d{3}', text)
print(phones) # ['0912-345-678']
Common Patterns
# Email
r'[\w\.-]+@[\w\.-]+\.\w+'
# Phone (VN)
r'0\d{9,10}'
r'(\+84|0)\d{9,10}'
# URL
r'https?://[^\s<>"{}|\\^`\[\]]+'
# Price
r'\$[\d,]+\.?\d*'
r'[\d,.]+\s*(đ|VND|VNĐ)'
# Date
r'\d{1,2}/\d{1,2}/\d{4}'
r'\d{4}-\d{2}-\d{2}'
Extract Groups
text = "Product: iPhone 15 Pro - Price: $999"
# Named groups
pattern = r'Product: (?P.+) - Price: \$(?P\d+)'
match = re.search(pattern, text)
if match:
print(match.group('name')) # iPhone 15 Pro
print(match.group('price')) # 999
Replace Và Clean
text = "Price: 1,234,567 VND"
# Remove commas
clean = re.sub(r',', '', text)
# "Price: 1234567 VND"
# Extract number only
price = re.search(r'[\d,]+', text).group()
price = int(price.replace(',', ''))
# 1234567
Multiline Matching
html = '''
Product Name
$99
'''
# DOTALL flag cho . match newlines
pattern = r'(.*?)'
match = re.search(pattern, html, re.DOTALL)
print(match.group(1))
Regex vs BeautifulSoup
| Task | Use |
|---|---|
| Parse HTML structure | BeautifulSoup |
| Extract from text | Regex |
| Complex HTML | BeautifulSoup |
| Simple patterns | Regex |
Performance Tips
# Compile pattern nếu dùng nhiều lần
email_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')
for text in texts:
emails = email_pattern.findall(text)
VinaProxy + Data Extraction
- Scrape raw data, extract với regex
- Process large volumes
- Giá chỉ $0.5/GB
