Open In App

How to Scrape all PDF files in a Website?

Prerequisites: Implementing Web Scraping in Python with BeautifulSoup

Web Scraping is a method of extracting data from the website and use that data for other uses. There are several libraries and modules for doing web scraping in Python.  In this article, we’ll learn how to scrape the PDF files from the website with the help of beautifulsoup, which is one of the best web scraping modules in python, and the requests module for the GET requests. Also, for getting more information about the PDF file, we use PyPDF2 module.



Step by Step Code –

Step 1: Import all the important modules and packages.






# for get the pdf files or url
import requests
 
# for tree traversal scraping in webpage
from bs4 import BeautifulSoup
 
# for input and output operations
import io
 
# For getting information about the pdfs
from PyPDF2 import PdfFileReader

Step 2: Passing the URL and make an HTML parser with the help of BeautifulSoup.




# website to scrap
 
# get the url from requests get method
read = requests.get(url)
 
# full html content
html_content = read.content
 
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")

In the above code:

Step 3: We need to traverse through the PDFs from the website.




# created an empty list for putting the pdfs
list_of_pdf = set()
 
# accessed the first p tag in the html
l = soup.find('p')
 
# accessed all the anchors tag from given p tag
p = l.find_all('a')
 
# iterate through p for getting all the href links
for link in p:
     
    # original html links
    print("links: ", link.get('href'))
    print("\n")
     
    # converting the extension from .html to .pdf
    pdf_link = (link.get('href')[:-5]) + ".pdf"
     
    # converted to .pdf
    print("converted pdf links: ", pdf_link)
    print("\n")
     
    # added all the pdf links to set
    list_of_pdf.add(pdf_link)

 
 

Output:

In the above code:

 Step 4: Create info function with pypdf2 module for getting all the required information of the pdf.




def info(pdf_path):
 
    # used get method to get the pdf file
    response = requests.get(pdf_path)
 
    # response.content generate binary code for
    # string function
    with io.BytesIO(response.content) as f:
 
        # initialized the pdf
        pdf = PdfFileReader(f)
 
        # all info about pdf
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
     
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
     
    return information

 
 In the above code: 

 Note: Refer Working with PDF files in Python for detailed information.




# print all the content of pdf in the console
for i in list_of_pdf:
    info(i)

Complete Code:




import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
 
 
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
 
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
 
for link in (p):
    pdf_link = (link.get('href')[:-5]) + ".pdf"
    print(pdf_link)
    list_of_pdf.add(pdf_link)
 
def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
 
for i in list_of_pdf:
    info(i)

Output:


Article Tags :