Social Media Scraping: Thu Thập Dữ Liệu Từ Facebook, Instagram, TikTok
Social media chứa insights quý giá. Bài viết hướng dẫn scrape social media một cách an toàn và hiệu quả.
Use Cases
- Brand monitoring: Theo dõi mentions
- Competitor analysis: Content strategy đối thủ
- Influencer research: Tìm KOLs phù hợp
- Sentiment analysis: Phản hồi khách hàng
- Trend tracking: Hashtags, topics hot
Cảnh Báo Quan Trọng
⚠️ Social platforms có ToS strict. Chỉ scrape public data và respect rate limits!
Phương Pháp Tiếp Cận
- Official APIs: Ưu tiên khi có thể
- Third-party APIs: RapidAPI, Apify
- Direct scraping: Rủi ro cao, cần proxy tốt
Instagram Hashtag Scraping
import requests
import json
def scrape_instagram_hashtag(hashtag):
# Instagram có API internal, cần login cookies
headers = {
'User-Agent': 'Mozilla/5.0...',
'Cookie': 'your_instagram_cookies'
}
url = f'https://www.instagram.com/explore/tags/{hashtag}/?__a=1&__d=dis'
response = requests.get(url, headers=headers,
proxies={'https': 'http://proxy.vinaproxy.com:8080'})
data = response.json()
posts = []
for edge in data['data']['hashtag']['edge_hashtag_to_media']['edges']:
node = edge['node']
posts.append({
'id': node['id'],
'likes': node['edge_liked_by']['count'],
'comments': node['edge_media_to_comment']['count'],
'caption': node.get('edge_media_to_caption', {}).get('edges', [{}])[0].get('node', {}).get('text', ''),
'url': f"https://instagram.com/p/{node['shortcode']}"
})
return posts
Facebook Public Pages
from playwright.sync_api import sync_playwright
def scrape_fb_page(page_url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(page_url)
# Scroll để load posts
for _ in range(5):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
posts = []
for post in page.query_selector_all('[data-ad-preview="message"]'):
posts.append({
'text': post.inner_text(),
# Extract more data as needed
})
browser.close()
return posts
TikTok Trending
# Dùng unofficial API hoặc libraries
# pip install TikTokApi
from TikTokApi import TikTokApi
api = TikTokApi()
# Get trending videos
trending = api.trending.videos(count=30)
for video in trending:
print({
'author': video.author.username,
'desc': video.desc,
'likes': video.stats.diggCount,
'shares': video.stats.shareCount,
'views': video.stats.playCount
})
Twitter/X Scraping
# snscrape - no API key needed
# pip install snscrape
import snscrape.modules.twitter as sntwitter
tweets = []
for tweet in sntwitter.TwitterSearchScraper('#python').get_items():
if len(tweets) >= 100:
break
tweets.append({
'date': tweet.date,
'content': tweet.content,
'user': tweet.user.username,
'likes': tweet.likeCount
})
Best Practices
- Respect ToS và rate limits
- Chỉ scrape public data
- Dùng residential/mobile proxy
- Rotate accounts nếu cần login
- Store data securely
VinaProxy + Social Scraping
- Mobile proxies cho Instagram/TikTok
- Residential IPs trusted
- Giá chỉ $0.5/GB
