In this article, we would be creating a program that would determine, whether two files provided to it are the same or not. By the same means that their contents are the same or not (excluding any metadata). We would be using Cryptographic Hashes for this purpose. A cryptographic hash function is a function that takes in input data and produces a statistically unique output, which is unique to that particular set of data. We would be using this property of Cryptographic hash functions to identify the contents of two files, and then would compare that to determine whether they are same or not.
Note: The probability of getting the same has for two different data set is very very low. And even then the good cryptographic hash functions are made so that hash collisions are accidental rather than intentional.
We would be using SHA256 (Secure hash algorithm 256) as a hash function in this program. SHA256 is very resistant to collisions. We would be using
sha256() to use the implementation of the function in python.
hashlib module is preinstalled in most python distributions. If it doesn’t exists in your environment, then you can get the module by running the following command in the command–
pip install hashlib
Below is the implementation.
Text File 1:
Text File 2:
For Different Files as Input:
For Same Files as Input:
We take in input the filenames (via command-line argument), therefore the file paths must be provided from the command line. The function
hashfile() is defined, to deal with arbitrary file sizes without running out of memory. As if we pass all the data in a file to the
sha256.update() function, it doesn’t hash the data properly leading to inconsistency in the results.
hashfile() returns the hash of the file in base16 (hexadecimal format). We call the same function for both the files and store their hashes in two separate variables. After which we use the hashes to compare them. If both the hashes are same (meaning the files contain same data), we output the message
Both files are same and then the hash. If they are different we output a negative message, and the hash of each file (so that the user can visually see the different hashes).
- How to compare values in two Pandas Dataframes?
- How to compare two NumPy arrays?
- Python | Merge two text files
- Python Program to merge two files into a third file
- Compare Pandas Dataframes using DataComPy
- Downloading files from web using Python
- Rename multiple files using Python
- Check if directory contains files using python
- Uploading files on Google Drive using Python
- Create temporary files and directories using Python-tempfile
- How to Delete files in Python using send2trash module?
- Joining Excel Data from Multiple files using Python Pandas
- Working with wav files in Python using Pydub
- Python program to find files having a particular extension using RegEx
- Deleting Files in HDFS using Python Snakebite
- Get list of files and folders in Google Drive storage using Python
- Creating Files in HDFS using Python Snakebite
- Python | Decimal compare() method
- Python | sympy.compare() method
- Python | Compare tuples
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.