Compare two files using Hashing in Python
In this article, we would be creating a program that would determine, whether two files provided to it are the same or not. By the same means that their contents are the same or not (excluding any metadata). We would be using Cryptographic Hashes for this purpose. A cryptographic hash function is a function that takes in input data and produces a statistically unique output, which is unique to that particular set of data. We would be using this property of Cryptographic hash functions to identify the contents of two files, and then would compare that to determine whether they are same or not.
Note: The probability of getting the same has for two different data set is very very low. And even then the good cryptographic hash functions are made so that hash collisions are accidental rather than intentional.
We would be using SHA256 (Secure hash algorithm 256) as a hash function in this program. SHA256 is very resistant to collisions. We would be using hashlib library’s sha256() to use the implementation of the function in python.
hashlib module is preinstalled in most python distributions. If it doesn’t exists in your environment, then you can get the module by running the following command in the command–
pip install hashlib
Below is the implementation.
Text File 1:
Text File 2:
For Different Files as Input:
For Same Files as Input:
We take in input the filenames (via command-line argument), therefore the file paths must be provided from the command line. The function hashfile() is defined, to deal with arbitrary file sizes without running out of memory. As if we pass all the data in a file to the sha256.update() function, it doesn’t hash the data properly leading to inconsistency in the results. hashfile() returns the hash of the file in base16 (hexadecimal format). We call the same function for both the files and store their hashes in two separate variables. After which we use the hashes to compare them. If both the hashes are same (meaning the files contain same data), we output the message Both files are same and then the hash. If they are different we output a negative message, and the hash of each file (so that the user can visually see the different hashes).
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course