Docker Cho Web Scraping: Containerize Scrapers

Trở lại Tin tức
Tin tức

Docker Cho Web Scraping: Containerize Scrapers

Docker giúp đóng gói scrapers với mọi dependencies, chạy consistent trên mọi môi trường.

Tại Sao Dùng Docker?

  • Reproducibility: Same environment everywhere
  • Isolation: Không conflict dependencies
  • Scaling: Dễ scale với multiple containers
  • Deployment: Deploy dễ dàng lên cloud

Dockerfile Cơ Bản

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY scraper.py .

CMD ["python", "scraper.py"]

requirements.txt

requests==2.31.0
beautifulsoup4==4.12.2
lxml==5.1.0

Build Và Run

# Build image
docker build -t my-scraper .

# Run container
docker run my-scraper

# Run với environment variables
docker run -e PROXY_URL=http://proxy.vinaproxy.com:8080 my-scraper

# Mount output directory
docker run -v $(pwd)/output:/app/output my-scraper

Dockerfile Với Selenium

FROM python:3.11-slim

# Install Chrome
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \
    && apt-get update \
    && apt-get install -y google-chrome-stable \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "scraper.py"]

Docker Compose

# docker-compose.yml
version: '3.8'

services:
  scraper:
    build: .
    environment:
      - PROXY_URL=http://proxy.vinaproxy.com:8080
    volumes:
      - ./output:/app/output
    
  redis:
    image: redis:alpine
    
  scheduler:
    build: .
    command: python scheduler.py
    depends_on:
      - redis

Multi-Stage Build

# Smaller final image
FROM python:3.11 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY scraper.py .
ENV PATH=/root/.local/bin:$PATH
CMD ["python", "scraper.py"]

Tips

  • Dùng slim/alpine images để giảm size
  • Multi-stage builds cho production
  • Đừng hardcode secrets – dùng env vars
  • Mount volumes cho output persistent

VinaProxy + Docker

  • Pass proxy qua environment variables
  • Scale containers với rotating IPs
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →