Scraping Indeed Job Data Using Python
In this article, we are going to see how to scrape Indeed job data using python. Here we will use Beautiful Soup and the request module to scrape the data.
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
- requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.
pip install requests
- Import all the required modules.
- Pass the URL in the getdata() function(User Defined Function) to that will request to a URL, it returns a response. We are using get method to retrieve information from the given server using a given URL.
- Convert that data into HTML code.
In the given image we see the link, where we search the job and its location then the URL becomes something like this https://in.indeed.com/jobs?q=”+job+”&l=”+Location, Hence we will format our string into this format.
- Now Parse the HTML content using bs4.
Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)
- r.content : It is the raw HTML content.
- html.parser : Specifying the HTML parser we want to use.
- Now filter the required data using soup.Find_all function.
- Now find the list with a tag where class_ = jobtitle turnstileLink. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.
- Find the Company name and address with the same as the above methods.
The code for this implementation is divided into user defined functions to increase the readability of the code and add ease of use.
- geturl(): gets the URL from which data is to be scraped
- html_code(): get HTML code of the URL provided
- job_data(): filter out job data
- Company_data(): filter company data