Check if two PDF documents are identical with Python

Last Updated : 17 May, 2022

Python is an interpreted and general purpose programming language. It is a Object-Oriented and Procedural paradigms programming language. There are various types of modules imported in python such as difflib, hashlib.

Modules used:

difflib : It is a module that contains function that allows to compare set of data.
SequenceMatcher : It is used to compare pair of input sequences.

Function Used:

hash_file ( string $algo , string $filename , bool $binary = false ): It is a function which has the hash of a file.
object.hexdigest(): It is a function which returns string.
fileObject.read(size): It is a function that returns the specified number of bytes of a file.

Approach

Import module
Declare a function with 2 arguments which is for file.
Declare two objects for hashlib.sha1()
Open files
Read the file by breaking the line into smaller chunks
Now return both file such as h1.hexdigest() which is of 160 bits.
Use hash_file() function to store the hash of a file.
Compare and generate appropriate message

Files in Use

File 1

comapre pdf 1

File 2

Program:

Python3

import hashlib 
from difflib import SequenceMatcher 
  
  
def hash_file(fileName1, fileName2): 
  
    # Use hashlib to store the hash of a file 
    h1 = hashlib.sha1() 
    h2 = hashlib.sha1() 
  
    with open(fileName1, "rb") as file: 
  
        # Use file.read() to read the size of file 
        # and read the file in small chunks 
        # because we cannot read the large files. 
        chunk = 0
        while chunk != b'': 
            chunk = file.read(1024) 
            h1.update(chunk) 
              
    with open(fileName2, "rb") as file: 
  
        # Use file.read() to read the size of file a 
        # and read the file in small chunks 
        # because we cannot read the large files. 
        chunk = 0
        while chunk != b'': 
            chunk = file.read(1024) 
            h2.update(chunk) 
  
        # hexdigest() is of 160 bits 
        return h1.hexdigest(), h2.hexdigest() 
  
  
msg1, msg2 = hash_file("pd1.pdf ", "pd1.pdf") 
  
if(msg1 != msg2): 
    print("These files are not identical") 
else: 
    print("These files are identical") 

Output

These files are not identical

Suggest improvement

Fun Fact Generator Web App in Python

Creating payment receipts using Python

Share your thoughts in the comments

Projects for Beginners

Projects for Intermediate

Web Scraping

Automating boring Stuff Using Python

Tkinter Projects

Turtle Projects

OpenCV Projects

Python Django Projects

Python Text to Speech and Vice-Versa

More Projects on Python