Mobile App Scraping: Thu Thập Data Từ Ứng Dụng Di Động
Nhiều data chỉ có trong mobile apps. Bài viết hướng dẫn scrape data từ mobile apps.
Thách Thức
- Không có HTML để parse
- Encrypted communications
- Certificate pinning
- Device fingerprinting
- API authentication
Approach 1: Intercept API Traffic
# Sử dụng mitmproxy để capture API calls
# 1. Install mitmproxy
pip install mitmproxy
# 2. Start proxy
mitmproxy -p 8080
# 3. Configure phone to use proxy
# - Set proxy: your-ip:8080
# - Install mitmproxy certificate
# 4. Use the app - see API calls in mitmproxy
# 5. Export API endpoints và replicate với Python
import requests
# Captured API from mitmproxy
api_url = 'https://app-api.example.com/v1/products'
headers = {
'Authorization': 'Bearer captured_token',
'X-Device-Id': 'captured_device_id',
'User-Agent': 'ExampleApp/1.0 Android/12'
}
response = requests.get(api_url, headers=headers)
data = response.json()
Approach 2: Frida (Advanced)
# Frida - dynamic instrumentation toolkit
# Bypass SSL pinning và hook functions
# Install
pip install frida-tools
# Disable SSL pinning script
# ssl_bypass.js
Java.perform(function() {
var TrustManager = Java.use('javax.net.ssl.X509TrustManager');
TrustManager.checkClientTrusted.implementation = function() {};
TrustManager.checkServerTrusted.implementation = function() {};
});
# Run
frida -U -f com.example.app -l ssl_bypass.js
Approach 3: Emulator + Automation
# Android emulator với UI automation
from appium import webdriver
desired_caps = {
'platformName': 'Android',
'deviceName': 'emulator-5554',
'appPackage': 'com.example.shop',
'appActivity': '.MainActivity'
}
driver = webdriver.Remote('http://localhost:4723/wd/hub', desired_caps)
# Navigate and extract
search_box = driver.find_element_by_id('search_input')
search_box.send_keys('product')
driver.find_element_by_id('search_button').click()
# Get results
results = driver.find_elements_by_class_name('product_item')
for item in results:
name = item.find_element_by_id('name').text
price = item.find_element_by_id('price').text
print(f"{name}: {price}")
driver.quit()
Approach 4: Reverse Engineer APK
# Decompile APK để tìm API endpoints
# 1. Install apktool
# 2. Decompile
apktool d app.apk -o output/
# 3. Search for API URLs
grep -r "api" output/
grep -r "https://" output/
# 4. Analyze với jadx for Java code
jadx app.apk -d jadx_output/
# Look for:
# - API base URLs
# - Authentication methods
# - Request signatures
App Store Scraping
# Google Play Store
from google_play_scraper import app, search
# Get app details
result = app('com.shopee.vn')
print(result['title'])
print(result['score'])
print(result['reviews'])
# Search apps
results = search('shopping', lang='vi', country='vn')
# App Store (iOS)
# Use iTunes API
import requests
response = requests.get(
'https://itunes.apple.com/lookup',
params={'id': '288419987', 'country': 'vn'}
)
app_info = response.json()['results'][0]
Ethical Considerations
- Respect ToS của app
- Don’t bypass security cho malicious purposes
- Be careful với personal data
- Consider legal implications
VinaProxy + Mobile Scraping
- Mobile proxy IPs cho app APIs
- Bypass device fingerprinting
- Giá chỉ $0.5/GB
