How to use Scrapy to parse PDF pages online?

Last Updated : 18 Jul, 2021

Prerequisite: Scrapy, PyPDF2, URLLIB

In this article, we will be using Scrapy to parse any online PDF without downloading it onto the system. To do that we have to use the PDF parser or editor library of Python know as PyPDF2.

PyPDF2 is a pdf parsing library of python, which provides various methods like reader methods, writer methods, and many more which are used to modify, edit and parse the pdfs either online or offline.

All the constructors of PyPDF2 classes require a stream of the PDF file. Now, since we can only achieve the URL of the pdf file, so to convert that URL to a file stream or simply open that URL we will require the use of urllib module of Python which can be used to call an urlopen() method on the request object returned by spider.

Example 1: We will be using some basic operations like extracting the page numbers and checking whether the file is encrypted or not. For this, we will parse with URL and find get the response then we will check the file pages and encrypted using numPages and isEncrypted.

Scrapy spider crawls the web page to find the pdf file online which is to be scrapped, then the URL of that pdf file is obtained from another variable URL, then the urllib is used to open the URL file and create a reader object of PyPDF2 lib by passing the stream link of the URL to the parameter of the Object’s constructor.

Python3

import io 
import PyPDF2 
import urllib.request 
import scrapy 
from scrapy.item import Item 
  
class ParserspiderSpider(scrapy.Spider): 
  
    name = 'parserspider'
      
    # URL of the pdf file . This is operating system 
    # book solution of author Albert Silberschatz 
       start_urls = ['https://codex.cs.yale.edu/avi/\ 
    os-book/OS9/practice-exer-dir/index.html'] 
         
    # default parse method 
    def parse(self, response):     
  
        # getting the list of URL of the pdf 
        pdfs = response.xpath('//tr[3]/td[2]/a/@href') 
          
           # Extracting the URL 
           URL = response.urljoin(pdfs[0].extract()) 
  
           # calling urllib to create a reader of the pdf url 
           File = urllib.request.urlopen(URL) 
           reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read())) 
  
           # accessing some descriptions of the pdf file. 
           print("This is the number of pages"+str(reader.numPages)) 
           print("Is file Encrypted?"+str(reader.isEncrypted))

Output:

First output the pages of pdf and whether it is encrypted or not

Example 2: In this example, we will be extracting the data of the pdf file (parsing), then the PyPDF2 object is used to make the required changes to the pdf file through the various methods mentioned above. We will print the extracted data to the terminal.

Python3

import io 
import PyPDF2 
import urllib.request 
import scrapy 
from scrapy.item import Item 
  
class ParserspiderSpider(scrapy.Spider):  
  
    name = 'parserspider'
  
    # URL of the pdf file. 
    start_urls = ['https://codex.cs.yale.edu/avi\ 
    /os-book/OS9/practice-exer-dir/index.html'] 
      
    # default parse method 
    def parse(self, response):  
        
        # getting the list of URL of the pdf 
        pdfs = response.xpath('//tr[3]/td[2]/a/@href') 
  
        # Extracting the URL 
        URL = response.urljoin(pdfs[0].extract()) 
  
        # calling urllib to create a reader of the pdf url 
        File = urllib.request.urlopen(URL) 
        reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read())) 
  
        # creating data 
        data="" 
        for datas in reader.pages: 
            data += datas.extractText() 
  
        print(data)