Create a web crawler for extracting and processing websites’ data.
Steps Involved in Web Crawling
To perform this tutorial step-by-step with me, you’ll need Python3 already configured on your local development machine. You can set up everything you need before-hand and then come back to continue ahead.
Creating a Basic Web Scraper
Web Scraping is a two-step process:
- You send HTTP request and get source code web pages.
- You take that source code and extract information from it.
Both these steps can be implemented in numerous ways in various languages. But we will be using request and bs4 packages of python to perform them.
- pip install beautifulsoup4
If you want to install BeautifulSoup4 without using pip or you face any issues during installation you can always refer to the official documentation.
- Create a new folder 📂 : With bs4 ready to be utilized, let’s create a new folder for our lab inside any code editor you want (I will be using Microsoft Visual Studio Code). You can do this like shown:
- Navigate inside the folder 🧭 : Now, go into the new directory you just created in the last step.
- Create .py file 🐍 : Then create a new Python file named crawler.py. We’ll write all the code for scraping in this file for this lab. You can create this file like this:
We’ll begin by creating a very basic crawler that is based upon bs4. To do that, we need to
- Import both the request and bs4 packages
- A URL that you want to crawl data from
Open your crawler.py file in the text editor and write this code to get started with a basic crawler:
Let’s talk what’s the above code actually for:
Import request package
Firstly, we import request package from urllib folder (a directory containing multiple packages related to HTTP requests and responses) of Python so that we can use a particular function that the package provides to make an HTTP request to the website, from where we are trying to scrape data, to get complete source code of its webpage.
import urllib.request as req
Import BeautifulSoup4 package
Next, we bring in the bs4 package that we installed using pip. Think of bs4 as a specialized package to read HTML or XML data. Bs4 has methods and behaviours that allow us to extract data from the webpages’ source code we provide to it, but it doesn’t know what data to look for or in which part to look out.
We will help it to gather information from the webpage and return that info back to us.
Provide the URL for webpage
Finally, we provide the crawler with URL of the webpage from where we want to start gathering data: https://www.indeed.co.in/python-jobs.
If you paste this URL in your browser, you will reach indeed.com’s search results page, showing the most relevant jobs out of 11K jobs containing Python as a skill required.
Next, we will send an HTTP request to this URL.
URL = “https://www.indeed.co.in/python-jobs“
Making an HTTP request
Now let’s make a request to indeed.com for the search results page, using HTTP(S) protocol. You typically make this request by using urlopen() from the request package of Python. However, the HTTP response we get is just an object and we cannot make anything useful out it. So, we will handover this object to bs4 to extract the source code and do the needful with it. Send a request to a particular website like this:
response = req.urlopen(URL)
Extracting the source code
Now let’s extract out the source code from the response object. You, generally, will do this by feeding this response object to the BeautifulSoup class present inside bs4 package. However, this source code is very large and it’s a very tedious task to read through it, so we would want to filter the information out of this source code later on. Hand over the response object to BeautifulSoup by writing the following line:
htmlSourceCode = bs4.BeautifulSoup(response)
Testing the crawler
Now let’s test out the code. You can run your Python files by running a command like python <filename> in the integrated terminal of VS Code. Moreover, VS Code has got a graphical play button which can directly run the file which is currently open in the text editor. Still, execute your file by running the following command:
You’ll see some HTML code like this (if you print out the HTML source code extracted by bs4):
Extracting data from HTML
We’ve written a very basic program that pulls down a page, but it doesn’t do anything interesting yet. Let’s extract some juice out of it 🍹.
If you visited the page on your browser that we are trying to crawl, then you would have noticed that our page has got a structure like this:
- There’s a navbar that’s present on almost every page of the website.
- There’s some search toolbar by which you can search for jobs having some particular keywords or for a specific location, then there’s a little area having sort options and the count of jobs relevant to our search, and a small sidebar for the site.
- Then there are the jobs themselves, displayed in an ordered list, and, if you would notice, each job posting has a similar format.
When writing code for crawling a page, it’s a good idea to look at the source code and get familiar with the structure to make it easy for yourself to build the logic of how to extract the required information. So, here it is, with a little enhanced readability:
- Scraping the actual data
Scraping data from the webpage’s source code is a two-step process:
- First, we will grab each job posting by looking for a part on the page that can give us a single job posting.
- Then, for each unique job posting, we will pull the data we want from it or from the HTML tags surrounding it.
bs4 lets us use some selectors based on which we can extract the data we want. Selectors are some patterns we can use to find one or more elements on a page and then use these elements to work upon the data stored inside them. bs4 supports both CSS and XPath selectors.
We’ll use CSS selectors for now as they are a much easier way and a perfect fit for our current purpose of finding all the jobs on the page. If you were vigilant while looking at the HTML code shown above, you would have noticed that each job posting is specified with a <div> tag having class attribute set as jobsearch-SerpJobCard. All we need to do is pass the name of tag enclosing our job posting and the value of class attribute linked with it, into the find_all( ) of the response object, like this:
This code extracts all the divs on the page containing job postings and iterates over them to print HTML blocks of each individual job. Now, let’s extract the information from these job postings so we can display only the details, and not the HTML code.
Another look at the HTML code of the page we’re parsing tells us that the title of each job is stored within an <h2> tag having class attribute value as “title”:
The jobPosting object that we are getting out of jobPostings list, has its own find( ) method, so we can pass in a CSS selector just like before to locate its child elements. Extend your code as follows to locate the title of each job and display it:
Note: jobTitle.text inside print( ) is necessary because jobTitle, after all, is an <h2> tag (HTML element), so we lots of html text with the title of job. So, I have used .text property of this element to print out only the actual title of job without any HTML garnishing.
If you run your code again, then this time you’ll see the titles of the jobs listed in the output:
Take another look at the HTML code of job to keep increasing the details extracted by using new selectors for company name, job location and salary:
Checking out jobs details
We can chalk out a few things by examining the above shown HTML snippet:
- The company name who has posted the job is stored inside a <span> tag having class=”company” attribute. Getting the <span class=”company”> may seem a little trickier but it doesn’t matter if it is nested in two other <div> tags, we can very easily extract out just the <span> tag by using another CSS selector and find( ) to fetch this value just like we did when we grabbed the title of each job.
- We’ll use another CSS selector for <div> containing the job location. To grab this value, we can just use “location” class instead of using everything written inside the class attribute of <div> tag because it’s too lengthy and doesn’t offer any extra benefit.
- Getting the salary range of a job is similar to getting the company name offering the job. There’s a <span> tag that contains the monthly or annual salary range, and has got a class attribute also.
Expanding crawler abilities
Now, comes the hard part – implementing the above findings practically. If you would try to crawl the data, you would experience two hiccups. We’ll discuss these hiccups later on, but let’s see the code to carry out the above discussed points.
So, let’s take on this part of getting more details from the job postings:
Now you’ll see all this tough code resulting in some new data getting printed alongside with the job title:
Next step is to turn this crawler into a spider that follows links.
But, before that we need to discuss how did the code work and how it handled both the issues I was talking about earlier. First hiccup is very obvious, every company doesn’t discloses the salary while posting the job. So, if you try to extract the salary for each job, you may sometimes get NoneType object. Hence, I used a single line if-else to handle this issue. If salaryRange comes out to be None, then, that means salary has not been disclosed so I passed a message to depict the same, otherwise the salary text is extracted and printed out on the console.
Second, if you would go through HTML code of multiple jobs, you would recognize that sometimes location is stored in a <span> and other times in a <div>, again, I had to use an if-else to handle this as well. And, as a text cleaner, I have used strip( ) also to remove leading and trailing spaces from the data we are getting from crawler to make it more presentable.