requests
and bs4
import requests
from bs4 import BeautifulSoup
Next Page: The start=
parameter gets added and incremented by the value of 10
for each additional page. This is because each results page displays 10 job results.
E.g.: https://www.indeed.com/jobs?q=python&l=new+york&start=20
Different Location/Job Title: The values for the query parameters q
(for job title) and l
(for location) change accordingly.
page = requests.get('https://www.indeed.com/jobs?q=python&l=new+york')
HTML Elements: A single job posting lives inside of a div
element with the class name result
. Inside there are other elements. You can find the specific info you're looking for here:
href
attribute of the <a>
Element that is a child of the title <h2>
element<h2>
element which also contains the link URL mentioned above<span>
element with the telling class name location
page_2 = requests.get('https://www.indeed.com/jobs?q=python&l=new+york&start=20')
Every 10 results means you're on a new page. Let's make that an argument to a function:
def get_jobs(page=1):
"""Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
base_url_indeed = 'https://www.indeed.com/jobs?q=python&l=new+york&start='
results_start_num = page*10
url = f'{base_url_indeed}{results_start_num}'
page = requests.get(url)
return page
get_jobs(3)
get_jobs(4)
Great! Let's customize this function some more to allow for different search queries and search locations:
def get_jobs(title, location, page=1):
"""Fetches the HTML from a search for Python jobs in New York on Indeed.com from a specified page."""
loc = location.replace(' ', '+') # for multi-part locations
base_url_indeed = f'https://www.indeed.com/jobs?q={title}&l={loc}&start='
results_start_num = page*10
url = f'{base_url_indeed}{results_start_num}'
page = requests.get(url)
return page
get_jobs('python', 'new york', 3)
With a generalized way of scraping the page done, you can move on to picking out the information you need by parsing the HTML.
Let's start by getting access to all interesting search results for one page:
site = get_jobs('python', 'new york')
soup = BeautifulSoup(site.content)
results = soup.find(id='resultsCol')
jobs = results.find_all('div', class_='result')
Job Titles can be found like this:
job_titles = [job.find('h2').find('a').text.strip() for job in jobs]
job_titles
Link URLs need to be assembled, and can be found like this:
base_url = 'https://www.indeed.com'
job_links = [base_url + job.find('h2').find('a')['href'] for job in jobs]
job_links
Locations can be picked out of the soup by their class name:
job_locations = [job.find(class_='location').text for job in jobs]
job_locations
Let's assemble all this info into a function, so you can pick out the pieces and save them to a useful data structure:
def parse_info(soup):
"""
Parses HTML containing job postings and picks out job title, location, and link.
args:
soup (BeautifulSoup object): A parsed bs4.BeautifulSoup object of a search results page on indeed.com
returns:
job_list (list): A list of dictionaries containing the title, link, and location of each job posting
"""
results = soup.find(id='resultsCol')
jobs = results.find_all('div', class_='result')
base_url = 'https://www.indeed.com'
job_list = list()
for job in jobs:
title = job.find('h2').find('a').text.strip()
link = base_url + job.find('h2').find('a')['href']
location = job.find(class_='location').text
job_list.append({'title': title, 'link': link, 'location': location})
return job_list
Let's give it a try:
page = get_jobs('python', 'new_york')
soup = BeautifulSoup(page.content)
results = parse_info(soup)
results
And let's add a final step of generalization:
def get_job_listings(title, location, amount=100):
results = list()
for page in range(amount//10):
site = get_jobs(title, location, page=page)
soup = BeautifulSoup(site.content)
page_results = parse_info(soup)
results += page_results
return results
r = get_job_listings('python', 'new york', 100)
len(r)
r[42]
Currently you are only fetching the title, link and location of the job. Change that to get also get the company name. Maybe you also want to know the beginning of the text blurb what the job is about? You could also build this script out to follow the links you gathered and fetch the individual job listing details pages for even more information.
The sky is the limit, and the more you train, the better you will get at this. :)