AI Trong Web Scraping: Tương Lai Của Data Extraction
AI đang thay đổi cách chúng ta scrape. Bài viết giới thiệu AI-powered scraping và các ứng dụng thực tế.
AI Giải Quyết Vấn Đề Gì?
- Dynamic selectors: AI tự tìm elements
- Unstructured data: Extract từ free text
- Layout changes: Tự adapt khi site thay đổi
- CAPTCHA: AI solvers
- Data cleaning: NLP cho text processing
1. LLM-Based Extraction
import openai
def extract_with_gpt(html_content, prompt):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a data extraction expert."},
{"role": "user", "content": f"""
Extract the following from this HTML:
{prompt}
HTML:
{html_content[:5000]}
Return as JSON.
"""}
]
)
return response.choices[0].message.content
# Usage
html = requests.get('https://shop.com/product').text
data = extract_with_gpt(html, "product name, price, description")
print(data)
2. Vision Models Cho Screenshots
import base64
from openai import OpenAI
client = OpenAI()
def extract_from_screenshot(image_path):
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Extract all product info from this screenshot. Return as JSON."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
]
}]
)
return response.choices[0].message.content
# Take screenshot, then extract
page.screenshot(path='product.png')
data = extract_from_screenshot('product.png')
3. NER Cho Entity Extraction
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities(text):
doc = nlp(text)
entities = {
'people': [],
'organizations': [],
'locations': [],
'dates': [],
'money': []
}
for ent in doc.ents:
if ent.label_ == 'PERSON':
entities['people'].append(ent.text)
elif ent.label_ == 'ORG':
entities['organizations'].append(ent.text)
elif ent.label_ in ['GPE', 'LOC']:
entities['locations'].append(ent.text)
elif ent.label_ == 'DATE':
entities['dates'].append(ent.text)
elif ent.label_ == 'MONEY':
entities['money'].append(ent.text)
return entities
# Extract từ article text
text = soup.select_one('.article-content').text
entities = extract_entities(text)
4. Sentiment Analysis
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")
def analyze_reviews(reviews):
results = []
for review in reviews:
result = sentiment(review['text'][:512])[0]
results.append({
'text': review['text'],
'sentiment': result['label'],
'score': result['score']
})
return results
# Analyze scraped reviews
reviews = scrape_reviews('https://shop.com/product/reviews')
analyzed = analyze_reviews(reviews)
5. Auto-Selector Generation
# Tools like Diffbot, Import.io use AI to auto-generate selectors
# Example concept:
def ai_find_products(html):
prompt = f"""
Analyze this HTML and identify:
1. CSS selector for product container
2. CSS selector for product name
3. CSS selector for price
4. CSS selector for image
HTML sample:
{html[:3000]}
"""
selectors = call_gpt(prompt)
return selectors
# AI returns:
# {
# "container": ".product-card",
# "name": ".product-card h2",
# "price": ".product-card .price",
# "image": ".product-card img"
# }
Challenges
- Cost: API calls expensive at scale
- Speed: AI slower than regex
- Accuracy: Not 100% reliable
- Rate limits: API quotas
VinaProxy + AI Scraping
- Reliable data collection cho AI processing
- Scale data gathering
- Giá chỉ $0.5/GB
