Mobile App Scraping: Thu Thập Data Từ Ứng Dụng Di Động

Trở lại Tin tức
Tin tức

Mobile App Scraping: Thu Thập Data Từ Ứng Dụng Di Động

Nhiều data chỉ có trong mobile apps. Bài viết hướng dẫn scrape data từ mobile apps.

Thách Thức

  • Không có HTML để parse
  • Encrypted communications
  • Certificate pinning
  • Device fingerprinting
  • API authentication

Approach 1: Intercept API Traffic

# Sử dụng mitmproxy để capture API calls

# 1. Install mitmproxy
pip install mitmproxy

# 2. Start proxy
mitmproxy -p 8080

# 3. Configure phone to use proxy
# - Set proxy: your-ip:8080
# - Install mitmproxy certificate

# 4. Use the app - see API calls in mitmproxy

# 5. Export API endpoints và replicate với Python
import requests

# Captured API from mitmproxy
api_url = 'https://app-api.example.com/v1/products'
headers = {
    'Authorization': 'Bearer captured_token',
    'X-Device-Id': 'captured_device_id',
    'User-Agent': 'ExampleApp/1.0 Android/12'
}

response = requests.get(api_url, headers=headers)
data = response.json()

Approach 2: Frida (Advanced)

# Frida - dynamic instrumentation toolkit
# Bypass SSL pinning và hook functions

# Install
pip install frida-tools

# Disable SSL pinning script
# ssl_bypass.js
Java.perform(function() {
    var TrustManager = Java.use('javax.net.ssl.X509TrustManager');
    TrustManager.checkClientTrusted.implementation = function() {};
    TrustManager.checkServerTrusted.implementation = function() {};
});

# Run
frida -U -f com.example.app -l ssl_bypass.js

Approach 3: Emulator + Automation

# Android emulator với UI automation

from appium import webdriver

desired_caps = {
    'platformName': 'Android',
    'deviceName': 'emulator-5554',
    'appPackage': 'com.example.shop',
    'appActivity': '.MainActivity'
}

driver = webdriver.Remote('http://localhost:4723/wd/hub', desired_caps)

# Navigate and extract
search_box = driver.find_element_by_id('search_input')
search_box.send_keys('product')
driver.find_element_by_id('search_button').click()

# Get results
results = driver.find_elements_by_class_name('product_item')
for item in results:
    name = item.find_element_by_id('name').text
    price = item.find_element_by_id('price').text
    print(f"{name}: {price}")

driver.quit()

Approach 4: Reverse Engineer APK

# Decompile APK để tìm API endpoints

# 1. Install apktool
# 2. Decompile
apktool d app.apk -o output/

# 3. Search for API URLs
grep -r "api" output/
grep -r "https://" output/

# 4. Analyze với jadx for Java code
jadx app.apk -d jadx_output/

# Look for:
# - API base URLs
# - Authentication methods
# - Request signatures

App Store Scraping

# Google Play Store
from google_play_scraper import app, search

# Get app details
result = app('com.shopee.vn')
print(result['title'])
print(result['score'])
print(result['reviews'])

# Search apps
results = search('shopping', lang='vi', country='vn')

# App Store (iOS)
# Use iTunes API
import requests

response = requests.get(
    'https://itunes.apple.com/lookup',
    params={'id': '288419987', 'country': 'vn'}
)
app_info = response.json()['results'][0]

Ethical Considerations

  • Respect ToS của app
  • Don’t bypass security cho malicious purposes
  • Be careful với personal data
  • Consider legal implications

VinaProxy + Mobile Scraping

  • Mobile proxy IPs cho app APIs
  • Bypass device fingerprinting
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →