Regex Trong Web Scraping: Extract Data Với Pattern Matching

Trở lại Tin tức
Tin tức

Regex Trong Web Scraping: Extract Data Với Pattern Matching

Regular expressions (regex) là công cụ mạnh để extract patterns từ text. Bài viết hướng dẫn dùng regex cho scraping.

Khi Nào Dùng Regex?

  • Extract emails, phones, URLs
  • Parse structured text (không có HTML tags)
  • Clean và normalize data
  • Validate formats

Python re Module

import re

text = "Contact: email@example.com or call 0912-345-678"

# Find all emails
emails = re.findall(r'[\w\.-]+@[\w\.-]+', text)
print(emails)  # ['email@example.com']

# Find all phone numbers
phones = re.findall(r'\d{4}-\d{3}-\d{3}', text)
print(phones)  # ['0912-345-678']

Common Patterns

# Email
r'[\w\.-]+@[\w\.-]+\.\w+'

# Phone (VN)
r'0\d{9,10}'
r'(\+84|0)\d{9,10}'

# URL
r'https?://[^\s<>"{}|\\^`\[\]]+'

# Price
r'\$[\d,]+\.?\d*'
r'[\d,.]+\s*(đ|VND|VNĐ)'

# Date
r'\d{1,2}/\d{1,2}/\d{4}'
r'\d{4}-\d{2}-\d{2}'

Extract Groups

text = "Product: iPhone 15 Pro - Price: $999"

# Named groups
pattern = r'Product: (?P.+) - Price: \$(?P\d+)'
match = re.search(pattern, text)

if match:
    print(match.group('name'))   # iPhone 15 Pro
    print(match.group('price'))  # 999

Replace Và Clean

text = "Price: 1,234,567 VND"

# Remove commas
clean = re.sub(r',', '', text)
# "Price: 1234567 VND"

# Extract number only
price = re.search(r'[\d,]+', text).group()
price = int(price.replace(',', ''))
# 1234567

Multiline Matching

html = '''

Product Name

$99
''' # DOTALL flag cho . match newlines pattern = r'
(.*?)
' match = re.search(pattern, html, re.DOTALL) print(match.group(1))

Regex vs BeautifulSoup

Task Use
Parse HTML structure BeautifulSoup
Extract from text Regex
Complex HTML BeautifulSoup
Simple patterns Regex

Performance Tips

# Compile pattern nếu dùng nhiều lần
email_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')

for text in texts:
    emails = email_pattern.findall(text)

VinaProxy + Data Extraction

  • Scrape raw data, extract với regex
  • Process large volumes
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →