AI Trong Web Scraping: Tương Lai Của Data Extraction

Trở lại Tin tức
Tin tức

AI Trong Web Scraping: Tương Lai Của Data Extraction

AI đang thay đổi cách chúng ta scrape. Bài viết giới thiệu AI-powered scraping và các ứng dụng thực tế.

AI Giải Quyết Vấn Đề Gì?

  • Dynamic selectors: AI tự tìm elements
  • Unstructured data: Extract từ free text
  • Layout changes: Tự adapt khi site thay đổi
  • CAPTCHA: AI solvers
  • Data cleaning: NLP cho text processing

1. LLM-Based Extraction

import openai

def extract_with_gpt(html_content, prompt):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a data extraction expert."},
            {"role": "user", "content": f"""
                Extract the following from this HTML:
                {prompt}
                
                HTML:
                {html_content[:5000]}
                
                Return as JSON.
            """}
        ]
    )
    return response.choices[0].message.content

# Usage
html = requests.get('https://shop.com/product').text
data = extract_with_gpt(html, "product name, price, description")
print(data)

2. Vision Models Cho Screenshots

import base64
from openai import OpenAI

client = OpenAI()

def extract_from_screenshot(image_path):
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode()
    
    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all product info from this screenshot. Return as JSON."},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}}
            ]
        }]
    )
    return response.choices[0].message.content

# Take screenshot, then extract
page.screenshot(path='product.png')
data = extract_from_screenshot('product.png')

3. NER Cho Entity Extraction

import spacy

nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    
    entities = {
        'people': [],
        'organizations': [],
        'locations': [],
        'dates': [],
        'money': []
    }
    
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            entities['people'].append(ent.text)
        elif ent.label_ == 'ORG':
            entities['organizations'].append(ent.text)
        elif ent.label_ in ['GPE', 'LOC']:
            entities['locations'].append(ent.text)
        elif ent.label_ == 'DATE':
            entities['dates'].append(ent.text)
        elif ent.label_ == 'MONEY':
            entities['money'].append(ent.text)
    
    return entities

# Extract từ article text
text = soup.select_one('.article-content').text
entities = extract_entities(text)

4. Sentiment Analysis

from transformers import pipeline

sentiment = pipeline("sentiment-analysis")

def analyze_reviews(reviews):
    results = []
    for review in reviews:
        result = sentiment(review['text'][:512])[0]
        results.append({
            'text': review['text'],
            'sentiment': result['label'],
            'score': result['score']
        })
    return results

# Analyze scraped reviews
reviews = scrape_reviews('https://shop.com/product/reviews')
analyzed = analyze_reviews(reviews)

5. Auto-Selector Generation

# Tools like Diffbot, Import.io use AI to auto-generate selectors
# Example concept:

def ai_find_products(html):
    prompt = f"""
    Analyze this HTML and identify:
    1. CSS selector for product container
    2. CSS selector for product name
    3. CSS selector for price
    4. CSS selector for image
    
    HTML sample:
    {html[:3000]}
    """
    
    selectors = call_gpt(prompt)
    return selectors

# AI returns:
# {
#   "container": ".product-card",
#   "name": ".product-card h2",
#   "price": ".product-card .price",
#   "image": ".product-card img"
# }

Challenges

  • Cost: API calls expensive at scale
  • Speed: AI slower than regex
  • Accuracy: Not 100% reliable
  • Rate limits: API quotas

VinaProxy + AI Scraping

  • Reliable data collection cho AI processing
  • Scale data gathering
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →