Skip to content
Related Articles

Related Articles

Improve Article
Check if two PDF documents are identical with Python
  • Last Updated : 08 Mar, 2021

Python is an interpreted and general purpose programming language. It is a Object-Oriented and Procedural paradigms programming language. There are various types of modules imported in python such as difflib, hashlib.

Modules used:

  • difflib : It is a module that contains function that allows to compare set of data.
  • SequenceMatcher : It is used to compare pair of input sequences.

Function Used:

  • hash_file ( string $algo , string $filename , bool $binary = false ): It is a function which has the hash of a file.
  • object.hexdigest(): It is a function which returns string.
  • fileObject.read(size): It is a function that returns the specified number of bytes of a file.

Approach

  • Import module
  • Declare a function with 2 arguments which is for file.
  • Declare two objects for hashlib.sha1()
  • Open files
  • Read the file by breaking the line into smaller chunks
  • Now return both file such as h1.hexdigest() which is of 160 bits.
  • Use hash_file() function to store the hash of a file.
  • Compare and generate appropriate message

Files in Use

comapre pdf 1

Program:

Python3






import hashlib
from difflib import SequenceMatcher
  
  
def hash_file(fileName1, fileName2):
  
    # Use hashlib to store the hash of a file
    h1 = hashlib.sha1()
    h2 = hashlib.sha1()
  
    with open(fileName1, "rb") as file:
  
        # Use file.read() to read the size of file
        # and read the file in small chunks
        # because we cannot read the large files.
        chunk = 0
        while chunk != b'':
            chunk = file.read(1024)
            h1.update(chunk)
              
    with open(fileName2, "rb") as file:
  
        # Use file.read() to read the size of file a
        # and read the file in small chunks
        # because we cannot read the large files.
        chunk = 0
        while chunk != b'':
            chunk = file.read(1024)
            h2.update(chunk)
  
        # hexdigest() is of 160 bits
        return h1.hexdigest(), h2.hexdigest()
  
  
msg1, msg2 = hash_file("pd1.pdf ", "pd1.pdf")
  
if(msg1 != msg2):
    print("These files are not identical")
else:
    print("These files are identical")

Output

These files are not identical

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :