Scrape Việc Làm Với Python: Thu Thập Job Listings
Job boards chứa data quý giá cho recruiters và job seekers. Bài viết hướng dẫn scrape job listings từ các trang tuyển dụng.
Use Cases
- Job aggregation: Tổng hợp từ nhiều nguồn
- Salary research: Phân tích mức lương
- Market demand: Skills đang hot
- Competitive intel: Ai đang tuyển gì
Các Trang Việc Làm VN
- vietnamworks.com
- topcv.vn
- careerbuilder.vn
- itviec.com (IT jobs)
- linkedin.com/jobs
Data Points
- Job title
- Company name
- Location
- Salary range
- Requirements
- Posted date
- Job type (full-time, remote)
Scraper Code
import requests
from bs4 import BeautifulSoup
import re
def scrape_job(url):
response = requests.get(url, headers={'User-Agent': '...'})
soup = BeautifulSoup(response.text, 'lxml')
return {
'title': soup.select_one('h1.job-title').text.strip(),
'company': soup.select_one('.company-name').text.strip(),
'location': soup.select_one('.location').text.strip(),
'salary': extract_salary(soup.select_one('.salary').text),
'description': soup.select_one('.job-description').text.strip(),
'requirements': soup.select_one('.requirements').text.strip(),
'posted': soup.select_one('.posted-date').text.strip(),
'url': url
}
def extract_salary(text):
# "15-25 triệu" -> {'min': 15000000, 'max': 25000000}
if not text or 'thỏa thuận' in text.lower():
return None
numbers = re.findall(r'\d+', text)
if len(numbers) >= 2:
return {
'min': int(numbers[0]) * 1_000_000,
'max': int(numbers[1]) * 1_000_000
}
return None
Scrape Job Listings
def scrape_job_board(keyword, pages=5):
jobs = []
for page in range(1, pages + 1):
url = f'https://topcv.vn/tim-viec-lam-{keyword}?page={page}'
response = requests.get(url, headers={'User-Agent': '...'})
soup = BeautifulSoup(response.text, 'lxml')
for item in soup.select('.job-item'):
job_url = item.select_one('a')['href']
job = scrape_job(job_url)
jobs.append(job)
time.sleep(1)
print(f"Page {page}: found jobs")
return jobs
# Search for Python jobs
python_jobs = scrape_job_board('python', pages=10)
Skills Extraction
common_skills = [
'python', 'javascript', 'react', 'node.js',
'sql', 'aws', 'docker', 'kubernetes'
]
def extract_skills(description):
description = description.lower()
found = [skill for skill in common_skills
if skill in description]
return found
# Analyze skill demand
from collections import Counter
all_skills = []
for job in python_jobs:
skills = extract_skills(job['description'])
all_skills.extend(skills)
skill_counts = Counter(all_skills)
print(skill_counts.most_common(10))
Salary Analysis
import pandas as pd
df = pd.DataFrame(python_jobs)
df['salary_avg'] = df['salary'].apply(
lambda x: (x['min'] + x['max']) / 2 if x else None
)
# Average salary by location
print(df.groupby('location')['salary_avg'].mean())
VinaProxy + Job Scraping
- Scrape job boards không bị block
- Geo-targeted cho local jobs
- Giá chỉ $0.5/GB
