BeautifulSoup Cho Web Scraping: Hướng Dẫn Từ A-Z

BeautifulSoup là thư viện Python phổ biến nhất để parse HTML. Bài viết hướng dẫn từ cơ bản đến nâng cao.

Cài Đặt

pip install beautifulsoup4 requests lxml

Cấu Trúc Cơ Bản

import requests
from bs4 import BeautifulSoup

# Fetch HTML
url = "https://example.com"
response = requests.get(url)

# Parse HTML
soup = BeautifulSoup(response.text, 'lxml')

# Lấy title
print(soup.title.text)

Tìm Elements

find() – Tìm 1 element

# Theo tag
soup.find('h1')

# Theo class
soup.find('div', class_='product')

# Theo id
soup.find('div', id='main')

# Theo attributes
soup.find('a', href='/about')

find_all() – Tìm nhiều elements

# Tất cả links
links = soup.find_all('a')
for link in links:
    print(link.get('href'))

# Nhiều classes
soup.find_all('div', class_=['item', 'product'])

CSS Selectors

# select_one - 1 element
soup.select_one('div.product > h2')

# select - nhiều elements
products = soup.select('ul.products li.item')

# Nested selectors
soup.select('div#main article.post h2.title')

Trích Xuất Data

# Text content
element.text
element.get_text(strip=True)

# Attributes
link.get('href')
img.get('src')
div.get('data-id')

# Inner HTML
str(element)

Ví Dụ Thực Tế: Scrape Products

import requests
from bs4 import BeautifulSoup

url = "https://shop.example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')

products = []
for item in soup.select('.product-card'):
    product = {
        'name': item.select_one('.title').text.strip(),
        'price': item.select_one('.price').text.strip(),
        'url': item.select_one('a').get('href'),
        'image': item.select_one('img').get('src')
    }
    products.append(product)

print(f"Found {len(products)} products")

Parser Comparison

Parser	Speed	Lenient
lxml	Nhanh nhất	Có
html.parser	Trung bình	Có
html5lib	Chậm nhất	Tốt nhất

Tips

Dùng lxml cho speed
CSS selectors dễ đọc hơn find()
Luôn handle None (element not found)
Dùng strip=True cho clean text

VinaProxy + BeautifulSoup

Kết hợp proxy để scrape scale lớn
Tránh IP ban khi crawl nhiều pages
Giá chỉ $0.5/GB

Dùng Thử Ngay →

BeautifulSoup Cho Web Scraping: Hướng Dẫn Từ A-Z

Cài Đặt

Cấu Trúc Cơ Bản

Tìm Elements

find() – Tìm 1 element

find_all() – Tìm nhiều elements

CSS Selectors

Trích Xuất Data

Ví Dụ Thực Tế: Scrape Products

Parser Comparison

Tips

VinaProxy + BeautifulSoup

admin