Lead Generation Với Web Scraping: Thu Thập Leads Tự Động
Web scraping là công cụ mạnh để thu thập leads cho sales và marketing. Bài viết hướng dẫn xây dựng lead scraper.
Nguồn Leads Phổ Biến
- Business directories: Yellow Pages, Yelp
- LinkedIn: Professional contacts
- Google Maps: Local businesses
- Industry websites: Company listings
- Job boards: Hiring companies
Data Points Cần Thu Thập
- Company name
- Website URL
- Email addresses
- Phone numbers
- Address/Location
- Industry/Category
- Company size
Scraper Code
import requests
from bs4 import BeautifulSoup
import re
import csv
def extract_emails(text):
pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
return list(set(re.findall(pattern, text)))
def extract_phones(text):
pattern = r'0\d{9,10}'
return list(set(re.findall(pattern, text)))
def scrape_business(url):
response = requests.get(url, headers={'User-Agent': '...'})
soup = BeautifulSoup(response.text, 'lxml')
return {
'name': soup.select_one('.business-name').text.strip(),
'website': soup.select_one('a.website')['href'],
'emails': extract_emails(response.text),
'phones': extract_phones(response.text),
'address': soup.select_one('.address').text.strip(),
'category': soup.select_one('.category').text.strip()
}
# Scrape multiple listings
leads = []
for page in range(1, 11):
url = f'https://directory.com/category?page={page}'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
for listing in soup.select('.listing'):
detail_url = listing.select_one('a')['href']
lead = scrape_business(detail_url)
leads.append(lead)
print(f"Scraped: {lead['name']}")
# Export to CSV
with open('leads.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=leads[0].keys())
writer.writeheader()
writer.writerows(leads)
Email Discovery
def find_company_emails(domain):
# Check common email patterns
patterns = [
f'info@{domain}',
f'contact@{domain}',
f'sales@{domain}',
f'hello@{domain}'
]
# Scrape contact page
contact_urls = [
f'https://{domain}/contact',
f'https://{domain}/contact-us',
f'https://{domain}/lien-he'
]
for url in contact_urls:
try:
response = requests.get(url, timeout=10)
emails = extract_emails(response.text)
if emails:
return emails
except:
continue
return []
Data Validation
import dns.resolver
def validate_email_domain(email):
domain = email.split('@')[1]
try:
dns.resolver.resolve(domain, 'MX')
return True
except:
return False
# Filter valid emails
valid_leads = [
lead for lead in leads
if lead['emails'] and validate_email_domain(lead['emails'][0])
]
Ethical Considerations
- Respect robots.txt và ToS
- Comply với GDPR/privacy laws
- Don’t spam collected emails
- Provide opt-out options
VinaProxy + Lead Generation
- Scrape directories không bị block
- Geo-targeted IPs cho local leads
- Giá chỉ $0.5/GB
