Web scraping describes a technique to automatically extract content from a website. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale. To do that, all you need is to spin up a small Python script and get familiar with a framework called BeautifulSoup. Than you just define which elements of a website you would like to extract and hit run!
So lets start with the basics, installing all necessary libraries.
pip install requests, beautifulsoup4
Requests is for sending a request to a website and therefore receiving the whole HTML code. BeautifulSoup will than take the HTML and filter out your defined target elements.
We will scrape the content of a job portal with the URL
and than extract content from each job description on the first page.
import requests URL = https://de.indeed.com/jobs?q=Python&l&from=searchOnHP&vjk=26417a569ba44519 page = requests.get(URL) print(page.content)
This will output the whole HTML content for the given URL.
Now let’s initiate BeautifulSoup to do some advanced content filtering.
from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, “html.parser”) results = soup.find(id=“resultsBody”) print(results.prettify())
Here we take the whole HTML code to the BeautifulSoup element and than filter out the HTMl element with the id resultsBody which represents the body content including all job descriptions. To actually get the id we need to open a browser of choice, call the URL and hit CTRL + i to open the developer mode. There we can see the whole HTML. With right clicking on an element and hitting inspect the target HTML part will be selected.
To narrow down our content we can filter all single job descriptions by looking for the right element which is in this case all div elements with the class name slider_item.
job_elements = results.find_all(“div”, class_=“slider_item”) for job_element in job_elements: print(job_element, end=“n”*2)
With the function find_all, every corresponding HTML element will be filtered and put into a list. This means we can loop through the list to do further extractions like printing out the job title or company. So let’s do that!
for job_element in job_elements: title_element = job_element.find(“h2”, class_=“jobTitle”) company_element = job_element.find(“span”, class_=“companyName”) location_element = job_element.find(“div”, class_=“companyLocation”) print(title_element) print(company_element) print(location_element) print(title_element.text)
The function find() takes two arguments, the element type and a class name to identify the target element. In our case we have three different element types. By printing out the filtered element we see the target HTML content. Most of the time we are just interested in the plain text, so we can use .text behind the variable to extract just the text.