Scrape Việc Làm Với Python: Thu Thập Job Listings

Trở lại Tin tức
Tin tức

Scrape Việc Làm Với Python: Thu Thập Job Listings

Job boards chứa data quý giá cho recruiters và job seekers. Bài viết hướng dẫn scrape job listings từ các trang tuyển dụng.

Use Cases

  • Job aggregation: Tổng hợp từ nhiều nguồn
  • Salary research: Phân tích mức lương
  • Market demand: Skills đang hot
  • Competitive intel: Ai đang tuyển gì

Các Trang Việc Làm VN

  • vietnamworks.com
  • topcv.vn
  • careerbuilder.vn
  • itviec.com (IT jobs)
  • linkedin.com/jobs

Data Points

  • Job title
  • Company name
  • Location
  • Salary range
  • Requirements
  • Posted date
  • Job type (full-time, remote)

Scraper Code

import requests
from bs4 import BeautifulSoup
import re

def scrape_job(url):
    response = requests.get(url, headers={'User-Agent': '...'})
    soup = BeautifulSoup(response.text, 'lxml')
    
    return {
        'title': soup.select_one('h1.job-title').text.strip(),
        'company': soup.select_one('.company-name').text.strip(),
        'location': soup.select_one('.location').text.strip(),
        'salary': extract_salary(soup.select_one('.salary').text),
        'description': soup.select_one('.job-description').text.strip(),
        'requirements': soup.select_one('.requirements').text.strip(),
        'posted': soup.select_one('.posted-date').text.strip(),
        'url': url
    }

def extract_salary(text):
    # "15-25 triệu" -> {'min': 15000000, 'max': 25000000}
    if not text or 'thỏa thuận' in text.lower():
        return None
    
    numbers = re.findall(r'\d+', text)
    if len(numbers) >= 2:
        return {
            'min': int(numbers[0]) * 1_000_000,
            'max': int(numbers[1]) * 1_000_000
        }
    return None

Scrape Job Listings

def scrape_job_board(keyword, pages=5):
    jobs = []
    
    for page in range(1, pages + 1):
        url = f'https://topcv.vn/tim-viec-lam-{keyword}?page={page}'
        response = requests.get(url, headers={'User-Agent': '...'})
        soup = BeautifulSoup(response.text, 'lxml')
        
        for item in soup.select('.job-item'):
            job_url = item.select_one('a')['href']
            job = scrape_job(job_url)
            jobs.append(job)
            time.sleep(1)
        
        print(f"Page {page}: found jobs")
    
    return jobs

# Search for Python jobs
python_jobs = scrape_job_board('python', pages=10)

Skills Extraction

common_skills = [
    'python', 'javascript', 'react', 'node.js', 
    'sql', 'aws', 'docker', 'kubernetes'
]

def extract_skills(description):
    description = description.lower()
    found = [skill for skill in common_skills 
             if skill in description]
    return found

# Analyze skill demand
from collections import Counter

all_skills = []
for job in python_jobs:
    skills = extract_skills(job['description'])
    all_skills.extend(skills)

skill_counts = Counter(all_skills)
print(skill_counts.most_common(10))

Salary Analysis

import pandas as pd

df = pd.DataFrame(python_jobs)
df['salary_avg'] = df['salary'].apply(
    lambda x: (x['min'] + x['max']) / 2 if x else None
)

# Average salary by location
print(df.groupby('location')['salary_avg'].mean())

VinaProxy + Job Scraping

  • Scrape job boards không bị block
  • Geo-targeted cho local jobs
  • Giá chỉ $0.5/GB

Dùng Thử Ngay →