How to Scrape all PDF files in a Website?

Prerequisites: Implementing Web Scraping in Python with BeautifulSoup

Web Scraping is a method of extracting data from the website and use that data for other uses. There are several libraries and modules for doing web scraping in Python. In this article, we’ll learn how to scrape the PDF files from the website with the help of beautifulsoup, which is one of the best web scraping modules in python, and the requests module for the GET requests. Also, for getting more information about the PDF file, we use PyPDF2 module.

Step by Step Code –

Step 1: Import all the important modules and packages.

Python3

# for get the pdf files or url

import requests
 
# for tree traversal scraping in webpage

from bs4 import BeautifulSoup
 
# for input and output operations

import io
 
# For getting information about the pdfs

from PyPDF2 import PdfFileReader

Step 2: Passing the URL and make an HTML parser with the help of BeautifulSoup.

Python3

# website to scrap

url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/amp/"
 
# get the url from requests get method

read = requests.get(url)
 
# full html content 

html_content = read.content
 
# Parse the html content 

soup = BeautifulSoup(html_content, "html.parser")

In the above code:

Scraping is done by the https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/amp/ link
requests module is used for making get request
read.content is used to go through all the HTML code. Printing will output the source code of the web page.
soup is having HTML content and used to parse the HTML

Step 3: We need to traverse through the PDFs from the website.

Python3

# created an empty list for putting the pdfs 

list_of_pdf = set()
 
# accessed the first p tag in the html 

l = soup.find('p') 
 
# accessed all the anchors tag from given p tag

p = l.find_all('a') 
 
# iterate through p for getting all the href links

for link in p: 

    # original html links

    print("links: ", link.get('href'))

    print("\n")

    # converting the extension from .html to .pdf

    pdf_link = (link.get('href')[:-5]) + ".pdf"

    # converted to .pdf

    print("converted pdf links: ", pdf_link)

    print("\n")

    # added all the pdf links to set

    list_of_pdf.add(pdf_link)

Output:

In the above code:

list_of_pdf is an empty set created for adding all the PDF files from the web page. Set is used because it never repeats the same-named elements. And automatically get rid of duplicates.
Iteration is done within all the links converting the .HTML to .pdf. It is done as the PDF name and HTML name has an only difference in the format, the rest all are same.
We use the set because we need to get rid of duplicate names. The list can also be used and instead of add, we append all the PDFs.

Step 4: Create info function with pypdf2 module for getting all the required information of the pdf.

Python3

def info(pdf_path):
 
    # used get method to get the pdf file

    response = requests.get(pdf_path)
 
    # response.content generate binary code for

    # string function

    with io.BytesIO(response.content) as f:
 
        # initialized the pdf

        pdf = PdfFileReader(f)
 
        # all info about pdf

        information = pdf.getDocumentInfo()

        number_of_pages = pdf.getNumPages()
 
    txt = f"""

    Information about {pdf_path}: 

    Author: {information.author}

    Creator: {information.creator}

    Producer: {information.producer}

    Subject: {information.subject}

    Title: {information.title}

    Number of pages: {number_of_pages}

    """

    print(txt)

    return information

In the above code:

Info function is responsible for giving all the required scraped output inside of the PDF.
io.BytesIO(response.content) – It is used because response.content is a binary code and the requests library is quite low leveled and generally compiled (not interpreted). So to handle byte, io.BytesIO is used.
There are several pypdfs2 functions to access different data in pdf.

Note: Refer Working with PDF files in Python for detailed information.

Python3

# print all the content of pdf in the console

for i in list_of_pdf:

    info(i)

Complete Code:

Python3

import requests

from bs4 import BeautifulSoup

import io

from PyPDF2 import PdfFileReader
 
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/amp/"

read = requests.get(url)

html_content = read.content

soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()

l = soup.find('p')

p = l.find_all('a')
 
for link in (p):

    pdf_link = (link.get('href')[:-5]) + ".pdf"

    print(pdf_link)

    list_of_pdf.add(pdf_link)
 
def info(pdf_path):

    response = requests.get(pdf_path)

    with io.BytesIO(response.content) as f:

        pdf = PdfFileReader(f)

        information = pdf.getDocumentInfo()

        number_of_pages = pdf.getNumPages()
 
    txt = f"""

    Information about {pdf_path}: 
 
    Author: {information.author}

    Creator: {information.creator}

    Producer: {information.producer}

    Subject: {information.subject}

    Title: {information.title}

    Number of pages: {number_of_pages}

    """

    print(txt)

    return information
 
for i in list_of_pdf:

    info(i)

Output:

Article Tags :

Python

Python BeautifulSoup

Python bs4-Exercises

Python web-scraping-exercises

Web-scraping