Docker Cho Web Scraping: Containerize Scrapers
Docker giúp đóng gói scrapers với mọi dependencies, chạy consistent trên mọi môi trường.
Tại Sao Dùng Docker?
- Reproducibility: Same environment everywhere
- Isolation: Không conflict dependencies
- Scaling: Dễ scale với multiple containers
- Deployment: Deploy dễ dàng lên cloud
Dockerfile Cơ Bản
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY scraper.py .
CMD ["python", "scraper.py"]
requirements.txt
requests==2.31.0
beautifulsoup4==4.12.2
lxml==5.1.0
Build Và Run
# Build image
docker build -t my-scraper .
# Run container
docker run my-scraper
# Run với environment variables
docker run -e PROXY_URL=http://proxy.vinaproxy.com:8080 my-scraper
# Mount output directory
docker run -v $(pwd)/output:/app/output my-scraper
Dockerfile Với Selenium
FROM python:3.11-slim
# Install Chrome
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
&& apt-get update \
&& apt-get install -y google-chrome-stable \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "scraper.py"]
Docker Compose
# docker-compose.yml
version: '3.8'
services:
scraper:
build: .
environment:
- PROXY_URL=http://proxy.vinaproxy.com:8080
volumes:
- ./output:/app/output
redis:
image: redis:alpine
scheduler:
build: .
command: python scheduler.py
depends_on:
- redis
Multi-Stage Build
# Smaller final image
FROM python:3.11 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY scraper.py .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "scraper.py"]
Tips
- Dùng slim/alpine images để giảm size
- Multi-stage builds cho production
- Đừng hardcode secrets – dùng env vars
- Mount volumes cho output persistent
VinaProxy + Docker
- Pass proxy qua environment variables
- Scale containers với rotating IPs
- Giá chỉ $0.5/GB
