Regex Trong Web Scraping: Extract Data Với Pattern Matching

Regular expressions (regex) là công cụ mạnh để extract patterns từ text. Bài viết hướng dẫn dùng regex cho scraping.

Khi Nào Dùng Regex?

Extract emails, phones, URLs
Parse structured text (không có HTML tags)
Clean và normalize data
Validate formats

Python re Module

import re

text = "Contact: email@example.com or call 0912-345-678"

# Find all emails
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print(emails)  # ['email@example.com']

# Find all phone numbers
phones = re.findall(r'\d{4}-\d{3}-\d{3}', text)
print(phones)  # ['0912-345-678']

Common Patterns

# Email
r'[\w\.-]+@[\w\.-]+\.\w+'

# Phone (VN)
r'0\d{9,10}'
r'(\+84|0)\d{9,10}'

# URL
r'https?://[^\s<>"{}|\\^`\[\]]+'

# Price
r'\$[\d,]+\.?\d*'
r'[\d,.]+\s*(đ|VND|VNĐ)'

# Date
r'\d{1,2}/\d{1,2}/\d{4}'
r'\d{4}-\d{2}-\d{2}'

Extract Groups

text = "Product: iPhone 15 Pro - Price: $999"

# Named groups
pattern = r'Product: (?P.+) - Price: \$(?P\d+)'
match = re.search(pattern, text)

if match:
    print(match.group('name'))   # iPhone 15 Pro
    print(match.group('price'))  # 999

Replace Và Clean

text = "Price: 1,234,567 VND"

# Remove commas
clean = re.sub(r',', '', text)
# "Price: 1234567 VND"

# Extract number only
price = re.search(r'[\d,]+', text).group()
price = int(price.replace(',', ''))
# 1234567

Multiline Matching

html = '''

    Product Name
    $99

'''

# DOTALL flag cho . match newlines
pattern = r'(.*?)'
match = re.search(pattern, html, re.DOTALL)
print(match.group(1))

Regex vs BeautifulSoup

Task	Use
Parse HTML structure	BeautifulSoup
Extract from text	Regex
Complex HTML	BeautifulSoup
Simple patterns	Regex

Performance Tips

# Compile pattern nếu dùng nhiều lần
email_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')

for text in texts:
    emails = email_pattern.findall(text)

VinaProxy + Data Extraction

Scrape raw data, extract với regex
Process large volumes
Giá chỉ $0.5/GB

Dùng Thử Ngay →

Regex Trong Web Scraping: Extract Data Với Pattern Matching

Khi Nào Dùng Regex?

Python re Module

Common Patterns

Extract Groups

Replace Và Clean

Multiline Matching

Product Name

Regex vs BeautifulSoup

Performance Tips

VinaProxy + Data Extraction

admin