Lead Generation Với Web Scraping: Thu Thập Leads Tự Động

Trở lại Tin tức
Tin tức

Lead Generation Với Web Scraping: Thu Thập Leads Tự Động

Web scraping là công cụ mạnh để thu thập leads cho sales và marketing. Bài viết hướng dẫn xây dựng lead scraper.

Nguồn Leads Phổ Biến

  • Business directories: Yellow Pages, Yelp
  • LinkedIn: Professional contacts
  • Google Maps: Local businesses
  • Industry websites: Company listings
  • Job boards: Hiring companies

Data Points Cần Thu Thập

  • Company name
  • Website URL
  • Email addresses
  • Phone numbers
  • Address/Location
  • Industry/Category
  • Company size

Scraper Code

import requests
from bs4 import BeautifulSoup
import re
import csv

def extract_emails(text):
    pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
    return list(set(re.findall(pattern, text)))

def extract_phones(text):
    pattern = r'0\d{9,10}'
    return list(set(re.findall(pattern, text)))

def scrape_business(url):
    response = requests.get(url, headers={'User-Agent': '...'})
    soup = BeautifulSoup(response.text, 'lxml')
    
    return {
        'name': soup.select_one('.business-name').text.strip(),
        'website': soup.select_one('a.website')['href'],
        'emails': extract_emails(response.text),
        'phones': extract_phones(response.text),
        'address': soup.select_one('.address').text.strip(),
        'category': soup.select_one('.category').text.strip()
    }

# Scrape multiple listings
leads = []
for page in range(1, 11):
    url = f'https://directory.com/category?page={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    for listing in soup.select('.listing'):
        detail_url = listing.select_one('a')['href']
        lead = scrape_business(detail_url)
        leads.append(lead)
        print(f"Scraped: {lead['name']}")

# Export to CSV
with open('leads.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=leads[0].keys())
    writer.writeheader()
    writer.writerows(leads)

Email Discovery

def find_company_emails(domain):
    # Check common email patterns
    patterns = [
        f'info@{domain}',
        f'contact@{domain}',
        f'sales@{domain}',
        f'hello@{domain}'
    ]
    
    # Scrape contact page
    contact_urls = [
        f'https://{domain}/contact',
        f'https://{domain}/contact-us',
        f'https://{domain}/lien-he'
    ]
    
    for url in contact_urls:
        try:
            response = requests.get(url, timeout=10)
            emails = extract_emails(response.text)
            if emails:
                return emails
        except:
            continue
    
    return []

Data Validation

import dns.resolver

def validate_email_domain(email):
    domain = email.split('@')[1]
    try:
        dns.resolver.resolve(domain, 'MX')
        return True
    except:
        return False

# Filter valid emails
valid_leads = [
    lead for lead in leads 
    if lead['emails'] and validate_email_domain(lead['emails'][0])
]

Ethical Considerations

  • Respect robots.txt và ToS
  • Comply với GDPR/privacy laws
  • Don’t spam collected emails
  • Provide opt-out options

VinaProxy + Lead Generation

  • Scrape directories không bị block
  • Geo-targeted IPs cho local leads
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →