Skip to content
Related Articles

Related Articles

Improve Article

Scrape LinkedIn Using Selenium And Beautiful Soup in Python

  • Difficulty Level : Expert
  • Last Updated : 14 Sep, 2021
Geek Week

In this article, we are going to scrape Linkedln using Selenium and Beautiful Soup libraries in Python.

First of all, we need to install some libraries. Execute the following commands in the terminal.

pip install selenium 
pip install beautifulsoup4

In order to use selenium, we also need a web driver. You can download the web driver of either Internet Explorer, Firefox, or Chrome. In this article, we will be using the Chrome web driver.

Note: While following along with this article, if you get an error, there are most likely 2 possible reasons for that.

  1. The webpage took too long to load (probably because of a slow internet connection). In this case, use time.sleep() function to provide extra time for the webpage to load. Specify the number of seconds to sleep as per your need.
  2. The HTML of the webpage has changed from the one when this article was written. If so, you will have to manually select the required webpage elements, instead of copying the element names written below. How to find the element names is explained below. Additionally, don’t decrease the window height and width from the default height and width. It also changes the HTML of the webpage.

Logging in to LinkedIn

Here we will write code for login into Linkedin, First, we need to initiate the web driver using selenium and send a get request to the URL and Identify the HTML document and find the input tags and button tags that accept username/email, password, and sign-in button.



LinkedIn Login Page

Code:

Python3




from selenium import webdriver
from bs4 import BeautifulSoup
import time
  
# Creating a webdriver instance
driver = webdriver.Chrome("Enter-Location-Of-Your-Web-Driver")
# This instance will be used to log into LinkedIn
  
# Opening linkedIn's login page
  
# waiting for the page to load
time.sleep(5)
  
# entering username
username = driver.find_element_by_id("username")
  
# In case of an error, try changing the element
# tag used here.
  
# Enter Your Email Address
username.send_keys("User_email")  
  
# entering password
pword = driver.find_element_by_id("password")
# In case of an error, try changing the element 
# tag used here.
  
# Enter Your Password
pword.send_keys("User_pass")        
  
# Clicking on the log in button
# Format (syntax) of writing XPath --> 
# //tagname[@attribute='value']
driver.find_element_by_xpath("//button[@type='submit']").click()
# In case of an error, try changing the
# XPath used here.

After executing the above command, you will be logged into your LinkedIn profile. Here is what it would look like.

Part 1 Code Execution

Extracting Data From a LinkedIn Profile

Here is the video of the execution of the complete code.

Part 2 Code Execution

2.A) Opening a Profile and Scrolling to the Bottom

Let us say that you want to extract data from Kunal Shah’s LinkedIn profile. First of all, we need to open his profile using the URL of his profile. Then we have to scroll to the bottom of the web page so that the complete data gets loaded.

Python3




from selenium import webdriver
from bs4 import BeautifulSoup
import time
  
# Creating an instance
driver = webdriver.Chrome("Enter-Location-Of-Your-Web-Driver")
  
# Logging into LinkedIn
time.sleep(5)
  
username = driver.find_element_by_id("username")
username.send_keys("")  # Enter Your Email Address
  
pword = driver.find_element_by_id("password")
pword.send_keys("")        # Enter Your Password
  
driver.find_element_by_xpath("//button[@type='submit']").click()
  
# Opening Kunal's Profile
# paste the URL of Kunal's profile here
  
driver.get(profile_url)        # this will open the link

Output:

Kunal Shah – LinkedIn Profile

Now, we need to scroll to the bottom. Here is the code to do that:



Python3




start = time.time()
  
# will be used in the while loop
initialScroll = 0
finalScroll = 1000
  
while True:
    driver.execute_script(f"window.scrollTo({initialScroll},
                                            {finalScroll})
                          ")
    # this command scrolls the window starting from
    # the pixel value stored in the initialScroll 
    # variable to the pixel value stored at the
    # finalScroll variable
    initialScroll = finalScroll
    finalScroll += 1000
  
    # we will stop the script for 3 seconds so that 
    # the data can load
    time.sleep(3)
    # You can change it as per your needs and internet speed
  
    end = time.time()
  
    # We will scroll for 20 seconds.
    # You can change it as per your needs and internet speed
    if round(end - start) > 20:
        break

The page is now scrolled to the bottom. As the page is completely loaded, we will scrape the data we want.

Extracting Data from the Profile

To extract data, firstly, store the source code of the web page in a variable. Then, use this source code to create a Beautiful Soup object.

Python3




src = driver.page_source
  
# Now using beautiful soup
soup = BeautifulSoup(src, 'lxml')

Extracting Profile Introduction:

To extract the profile introduction, i.e., the name, the company name, and the location, we need to find the source code of each element. First, we will find the source code of the div tag that contains the profile introduction.

Chrome – Inspect Elements

Now, we will use Beautiful Soup to import this div tag into python.

Python3




# Extracting the HTML of the complete introduction box
# that contains the name, company name, and the location
intro = soup.find('div', {'class': 'pv-text-details__left-panel'})
  
print(intro)

Output:

(Scribbled) Introduction HTML 

We now have the required HTML to extract the name, company name, and location. Let’s extract the information now:

Python3






# In case of an error, try changing the tags used here.
  
name_loc = intro.find("h1")
  
# Extracting the Name
name = name_loc.get_text().strip()
# strip() is used to remove any extra blank spaces
  
works_at_loc = intro.find("div", {'class': 'text-body-medium'})
  
# this gives us the HTML of the tag in which the Company Name is present
# Extracting the Company Name
works_at = works_at_loc.get_text().strip()
  
  
location_loc = intro.find_all("span", {'class': 'text-body-small'})
  
# Ectracting the Location
# The 2nd element in the location_loc variable has the location
location = location_loc[1].get_text().strip()
  
print("Name -->", name,
      "\nWorks At -->", works_at,
      "\nLocation -->", location)

Output:

Name --> Kunal Shah 
Works At --> Founder : CRED 
Location --> Bengaluru, Karnataka, India

Extracting Data from the Experience Section

Next, we will extract the Experience from the profile.

HTML of Experience Section

Python3




# Getting the HTML of the Experience section in the profile
experience = soup.find("section", {"id": "experience-section"}).find('ul')
  
print(experience)

Output:

Experience HTML Output

We have to go inside the HTML tags until we find our desired information. In the above image, we can see the HTML to extract the current job title and the name of the company. We now need to go inside each tag to extract the data

Scrape Job Title, company name and experience:

Python3




# In case of an error, try changing the tags used here.
  
li_tags = experience.find('div')
a_tags = li_tags.find("a")
job_title = a_tags.find("h3").get_text().strip()
  
print(job_title)
  
company_name = a_tags.find_all("p")[1].get_text().strip()
print(company_name)
  
joining_date = a_tags.find_all("h4")[0].find_all("span")[1].get_text().strip()
  
employment_duration = a_tags.find_all("h4")[1].find_all(
    "span")[1].get_text().strip()
  
print(joining_date + ", " + employment_duration)

Output:

'Founder'
'CRED'
Apr 2018 – Present, 3 yrs 6 mos

Extracting Job Search Data

We will use selenium to open the jobs page.

Python3




jobs = driver.find_element_by_xpath("//a[@data-link-to='jobs']/span")
# In case of an error, try changing the XPath.
  
jobs.click()

Now that the jobs page is open, we will create a BeautifulSoup object to scrape the data.



Python3




job_src = driver.page_source
  
soup = BeautifulSoup(job_src, 'lxml')   

Scrape Job Title:

First of all, we will scrape the Job Titles.

HTML of Job Title

On skimming through the HTML of this page, we will find that each Job Title has the class name “job-card-list__title”. We will use this class name to extract the job titles.

Python3




jobs_html = soup.find_all('a', {'class': 'job-card-list__title'})
# In case of an error, try changing the XPath.
  
job_titles = []
  
for title in jobs_html:
    job_titles.append(title.text.strip())
  
print(job_titles)

 Output:

Job Titles List

Scrape Company Name:

Next, we will extract the Company Name.

HTML of Company Name

We will use the class name to extract the names of the companies:

Python3




company_name_html = soup.find_all(
  'div', {'class': 'job-card-container__company-name'})
company_names = []
  
for name in company_name_html:
    company_names.append(name.text.strip())
  
print(company_names)

Output:

Company Names List

Scrape Job Location:

Finally, we will extract the Job Location.

HTML of Job Location

Once again, we will use the class name to extract the location.

Python3




import re   # for removing the extra blank spaces
  
location_html = soup.find_all(
    'ul', {'class': 'job-card-container__metadata-wrapper'})
  
location_list = []
  
for loc in location_html:
    res = re.sub('\n\n +', ' ', loc.text.strip())
  
    location_list.append(res)
  
print(location_list)

Output:

Job Locations List

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :