How to Scrape Multiple Pages of a Website Using Python?

Last Updated : 03 Oct, 2023

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, etc.

Let’s dive deep and scrape a website. In this article, we are going to take the GeeksforGeeks website and extract the titles of all the articles available on the Homepage using a Python script.

If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any!

GeeksforGeeks Homepage

Scraping multiple Pages of a website Using Python

Now, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation. Duh!

To solve this exact problem, we will see two main techniques that will help us extract data from multiple webpages:

The same website
Different website URLs

Approach:

The approach of the program will be fairly simple, and it will be easier to understand it in a POINT format:

We’ll import all the necessary libraries.
Set up our URL strings for making a connection using the requests library.
Parsing the available data from the target page using the BeautifulSoup library’s parser.
From the target page, Identify and Extract the classes and tags which contain the information that is valuable to us.
Prototype it for one page using a loop and then apply it to all the pages.

Example 1: Looping through the page numbers

page numbers at the bottom of the GeeksforGeeks website

Most websites have pages labeled from 1 to N. This makes it really simple for us to loop through these pages and extract data from them as these pages have similar structures. For example:

notice the last section of the URL – page/4/

Here, we can see the page details at the end of the URL. Using this information we can easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string and iterating “i” till N) and scrape all the useful data from them. The following code will give you more clarity over how to scrape data by using a For Loop in Python.

Python

import requests 
from bs4 import BeautifulSoup as bs 
  
URL = 'https://www.geeksforgeeks.org/page/1/'
  
req = requests.get(URL) 
soup = bs(req.text, 'html.parser') 
  
titles = soup.find_all('div',attrs = {'class','head'}) 
  
print(titles[4].text) 

Output:

Output for the above code

Now, using the above code, we can get the titles of all the articles by just sandwiching those lines with a loop.

Python

import requests 
from bs4 import BeautifulSoup as bs 
  
URL = 'https://www.geeksforgeeks.org/page/1/'
  
for page in range(1,10): 
      # pls note that the total number of 
    # pages in the website is more than 5000 so i'm only taking the 
    # first 10 as this is just an example 
  
    req = requests.get(URL + str(page) + '/') 
    soup = bs(req.text, 'html.parser') 
  
    titles = soup.find_all('div',attrs={'class','head'}) 
  
    for i in range(4,19): 
        if page>1: 
            print(f"{(i-3)+page*15}" + titles[i].text) 
        else: 
            print(f"{i-3}" + titles[i].text)

Output:

Output for the above code

Note: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages.

Example 2: Looping through a list of different URLs.

The above technique is absolutely wonderful, but what if you need to scrape different pages, and you don’t know their page numbers? You’ll need to scrape those different URLs one by one and manually code a script for every such webpage.

Instead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i.e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here’s an example code of how you can do it.

Python

import requests 
from bs4 import BeautifulSoup as bs 
URL = ['https://www.geeksforgeeks.org','https://www.geeksforgeeks.org/page/10/'] 
  
for url in range(0,2): 
    req = requests.get(URL[url]) 
    soup = bs(req.text, 'html.parser') 
  
    titles = soup.find_all('div',attrs={'class','head'}) 
    for i in range(4, 19): 
        if url+1  > 1: 
            print(f"{(i - 3) + url * 15}" + titles[i].text) 
        else: 
            print(f"{i - 3}" + titles[i].text) 

Output:

Output for the above code

How to avoid getting your IP address banned?

Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively.

Python3

from random import randint 
from time import sleep  
  
print(randint(1,10))

Output

The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for reference.

Python3

from time import *
from random import randint 
  
for i in range(0,3): 
  # selects random integer in given range 
  x = randint(2,5) 
  print(x) 
  sleep(x) 
  print(f'I waited {x} seconds')

Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

To get you a clear idea of this function in action, refer to the code given below.

Python3

import requests 
from bs4 import BeautifulSoup as bs 
from random import randint 
from time import sleep 
  
URL = 'https://www.geeksforgeeks.org/page/1/'
  
for page in range(1,10):  
      # pls note that the total number of 
    # pages in the website is more than 5000 so i'm only taking the 
    # first 10 as this is just an example 
  
    req = requests.get(URL + str(page) + '/') 
    soup = bs(req.text, 'html.parser') 
  
    titles = soup.find_all('div',attrs={'class','head'}) 
  
    for i in range(4,19): 
        if page>1: 
            print(f"{(i-3)+page*15}" + titles[i].text) 
        else: 
            print(f"{i-3}" + titles[i].text) 
  
    sleep(randint(2,10))

Output:

The program has paused its execution and is waiting to resume

Python Matrix Exercises

Python Functions Exercises

Python Lambda Exercises

Python Pattern printing Exercises

Python DateTime Exercises

Python OOPS Exercises

Python Regex Exercises

Python LinkedList Exercises

Python Searching Exercises

Python Sorting Exercises

Python DSA Exercises

Python File Handling Exercises

Python CSV Exercises

Python JSON Exercises

Python OS Module Exercises

Python Tkinter Exercises

Python Web Scraping Exercises

Python Selenium Exercises

How to Scrape Multiple Pages of a Website Using Python?

Scraping multiple Pages of a website Using Python

Python

Python

Python

How to avoid getting your IP address banned?

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?