Open In App

Pagination – xpath for a crawler in Python

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn about pagination using XPath for a crawler in Python.

This article is about learning how to extract information from different websites where information is stored on multiple pages. So to move to all pages via API’s call we use a concept of paging which helps to navigate through them.

We’ll use selenium and BeautifulSoup for this. Selenium is a free open-source framework, used for testing. Selenium Driver (which we will use), and selenium grid. Many languages can be used to create a selenium script like java, Python, c#, etc. BeautifulSoup (a library in Python) is used for extracting data from HTML and XML files. The tool works with the preferred parser to provide idiomatic means of navigating, searching, and modifying the parse tree. Execute the below commands to install selenium and beautifulSoup in the Python environment.

pip install selenium
pip install beautifulsoup4

Prerequisite:

  1. Selenium 
  2. beautifulSoup
  3. Basics of Python
  4. Chrome driver(Download chrome driver for selenium)

What is pagination?

Pagination is the process of dividing a document into different pages, which means a number of different pages which has data. These different pages of websites have their own different URL, having minor changes usually. So we have to reach these URL one by one and scrape data that these pages contain till the last page. There are generally two ways where pagination is needed, first where pages have a next page button and second where there is no next button but an infinite scroll having a new location. 

So now let’s see with the help of an example, how to implement pagination by XPath in Python.

Implementation of pagination by XPath in Python and selenium:

We are scraping from the GeeksforGeeks website with articles links and titles and applying pagination. As a result, we’ll have a set of links and titles of articles.

Step 1: Firstly we will import all the required modules used for pagination.

Python




# Importing required modules
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
from bs4 import BeautifulSoup


Let us understand the usage of imported files and how they are useful in our problem statement.

webdriver: Selenium WebDriver is a tool that is used to automate web-based application testing to verify it performs as expected. It enables to execution of cross-browser tests.

By: Using the By class, we can locate elements within a document. It has a different parameters to find the element in the doc via the parser. It has CLASS_NAME, CSS_SELECTOR ,ID , LINK_TEXT , NAME .PARTIAL_LINK_TEXT , TAG_NAME , XPath parameters to locate the element into an HTML.

Keys: Using the Keys class imported from selenium.webdriver.common.keys, we can send special keys similar to how we enter keys with our keyboards

sleep: Using the Python time sleep function, we can pause the execution of a program for a specified time in seconds.

Step 2: Initialise the driver and access the website where you need to apply paging or pagination:

Python




# Download chrome driver and enter its path inside string.
PATH = ""
driver = webdriver.Chrome(PATH)
driver.get("https://www.geeksforgeeks.org/\
           category/programming-language/python/")


 

Step 3: The XML Path Language is also known as XPath. It is possible to select nodes or node sets in XML documents with path expressions. The expression looks like a registry folder file system in windows. To copy the XPath inspect the webpage and then select the link after that right click on tag and copy XPath as show in the below image.

Press copy XPath

Maybe copied XPath be according to id, we can replace it with the class also and enter the relevant class which is available.

Step 4: Finding the correct XPath is an important task. And then if we have to iterate on different pages then we have to check manually which div is changing. If we want to scrape for a single element we can directly copy and paste the path. 

Python




# add path of chrome driver, 
# Replace it with the below string
PATH ="C:/Users/dvgrg/OneDrive/Desktop/dexterio\
       internship/workflow/Current/version 1.0/fabric\
       data/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.geeksforgeeks.org/\
            category/programming-language/python/")


Step 5: In this step, firstly we initialize the variable page with “1” and declare two list to store link and title and these to lists are stored in a dictionary named “result”. we are running a while loop to extract the links from page using link and inside the while loop we are running a for loop with the help of XPath and inside this loop we are appending links and titles in their respective lists and store these lists in a dictionary result. This for loop is terminated when it iterated upto 9th page.

Python




page = 1
# we'll append links in the list called link
link = []  
# we'll append title in the list called title
title = []  
# we'll append result into the dictionary
result = {"link" : [] , "title" : []}    
  
while(page):
    print(page)
      
    url = "https://www.geeksforgeeks.org/\
    category/programming-language/python/page/{}".format(page)
    driver.get(url)
  
    for element in driver.find_elements(By.XPATH,
     '//*[@class ="articles-list"]/div/div/div/a'):
          # for link we are using href
          li = element.get_attribute('href'
          # for title we are using title
          ti = element.get_attribute('title')    
  
          link.append(li)
          title.append(ti)
        #   we can try printing link and title
          print(link,title)
        #   lets run.
  
          result['link'].append(li)
          result['title'].append(ti)
  
    page += 1  
    # Give any limit where you need to stop the loop
    if page == 9:   
        break


Complete code:

Python3




# Importing required modules
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from time import sleep
from bs4 import BeautifulSoup
  
# Download chrome driver and enter its path inside string.
PATH = ""
driver = webdriver.Chrome(PATH)
driver.get("https://www.geeksforgeeks.org/\
           category/programming-language/python/")
  
# add path of chrome driver, 
# Replace it with the below string
PATH ="C:/Users/dvgrg/OneDrive/Desktop/dexterio\
       internship/workflow/Current/version 1.0/fabric\
       data/chromedriver_win32/chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://www.geeksforgeeks.org/\
            category/programming-language/python/")
  
page = 1
# we'll append links in the list called link
link = []  
# we'll append title in the list called title
title = []  
# we'll append result into the dictionary
result = {"link" : [] , "title" : []}    
  
while(page):
    print(page)
      
    url = "https://www.geeksforgeeks.org/\
    category/programming-language/python/page/{}".format(page)
    driver.get(url)
  
    for element in driver.find_elements(By.XPATH,
     '//*[@class ="articles-list"]/div/div/div/a'):
          # for link we are using href
          li = element.get_attribute('href'
          # for title we are using title
          ti = element.get_attribute('title')    
  
          link.append(li)
          title.append(ti)
        #   we can try printing link and title
          print(link,title)
        #   lets run.
  
          result['link'].append(li)
          result['title'].append(ti)
  
    page += 1  
    # Give any limit where you need to stop the loop
    if page == 9:   
        break


Output:

The output will print page number, links, and title. 



Last Updated : 30 Jan, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads