Open In App

How to use Scrapy to parse PDF pages online?

Last Updated : 18 Jul, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite: Scrapy, PyPDF2, URLLIB

In this article, we will be using Scrapy to parse any online PDF without downloading it onto the system. To do that we have to use the PDF parser or editor library of Python know as PyPDF2

PyPDF2 is a pdf parsing library of python, which provides various methods like reader methods, writer methods, and many more which are used to modify, edit and parse the pdfs either online or offline.

All the constructors of PyPDF2 classes require a stream of the PDF file. Now, since we can only achieve the URL of the pdf file, so to convert that URL to a file stream or simply open that URL we will require the use of urllib module of Python which can be used to call an urlopen() method on the request object returned by spider.

Example 1: We will be using some basic operations like extracting the page numbers and checking whether the file is encrypted or not. For this, we will parse with URL and find get the response then we will check the file pages and encrypted using numPages and isEncrypted.

Scrapy spider crawls the web page to find the pdf file online which is to be scrapped, then the URL of that pdf file is obtained from another variable URL, then the urllib is used to open the URL file and create a reader object of PyPDF2 lib by passing the stream link of the URL to the parameter of the Object’s constructor.

Python3




import io
import PyPDF2
import urllib.request
import scrapy
from scrapy.item import Item
  
class ParserspiderSpider(scrapy.Spider):
  
    name = 'parserspider'
      
    # URL of the pdf file . This is operating system
    # book solution of author Albert Silberschatz
       start_urls = ['https://codex.cs.yale.edu/avi/\
    os-book/OS9/practice-exer-dir/index.html']
         
    # default parse method
    def parse(self, response):    
  
        # getting the list of URL of the pdf
        pdfs = response.xpath('//tr[3]/td[2]/a/@href')
          
           # Extracting the URL
           URL = response.urljoin(pdfs[0].extract())
  
           # calling urllib to create a reader of the pdf url
           File = urllib.request.urlopen(URL)
           reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read()))
  
           # accessing some descriptions of the pdf file.
           print("This is the number of pages"+str(reader.numPages))
           print("Is file Encrypted?"+str(reader.isEncrypted))


Output:

First output the pages of pdf and whether it is encrypted or not

Example 2: In this example, we will be extracting the data of the pdf file (parsing), then the PyPDF2 object is used to make the required changes to the pdf file through the various methods mentioned above. We will print the extracted data to the terminal.

Python3




import io
import PyPDF2
import urllib.request
import scrapy
from scrapy.item import Item
  
class ParserspiderSpider(scrapy.Spider): 
  
    name = 'parserspider'
  
    # URL of the pdf file.
    start_urls = ['https://codex.cs.yale.edu/avi\
    /os-book/OS9/practice-exer-dir/index.html']
      
    # default parse method
    def parse(self, response): 
        
        # getting the list of URL of the pdf
        pdfs = response.xpath('//tr[3]/td[2]/a/@href')
  
        # Extracting the URL
        URL = response.urljoin(pdfs[0].extract())
  
        # calling urllib to create a reader of the pdf url
        File = urllib.request.urlopen(URL)
        reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read()))
  
        # creating data
        data=""
        for datas in reader.pages:
            data += datas.extractText()
  
        print(data)


Output:

Parsed Pdf



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads