In this article, we will code a python script to find duplicate files in the file system or inside a particular folder.
Method 1: Using Filecmp
The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical otherwise False.
Syntax: filecmp.cmp(f1, f2, shallow)
Parameters:
- f1: Name of one file
- f2: Name of another file to be compared
- shallow: With this, we set if we want to compare content or not.
Note: The default value is True which ensures that only the signature of files is compared not content.
Return Type: Boolean value (True if the files are same otherwise False)
Example:
We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.
# Importing Libraries import os
from pathlib import Path
from filecmp import cmp
# list of all documents DATA_DIR = Path( '/path/to/directory' )
files = sorted (os.listdir(DATA_DIR))
# List having the classes of documents # with the same content duplicateFiles = []
# comparison of the documents for file_x in files:
if_dupl = False
for class_ in duplicateFiles:
# Comparing files having same content using cmp()
# class_[0] represents a class having same content
if_dupl = cmp (
DATA_DIR / file_x,
DATA_DIR / class_ [ 0 ],
shallow = False
)
if if_dupl:
class_ .append(file_x)
break
if not if_dupl:
duplicateFiles.append([file_x])
# Print results print (duplicateFiles)
|
Output:
Method 2: Using Hashing and Dictionary
To start, this script will get a single folder or a list of folders, then through traversing the folder it will find duplicate files. Next, this script will compute a hash for every file present in the folder regardless of their name and are stored in a dictionary manner with hash being the key and path to the file as value.
- We have to import os, sys, hashlib libraries.
- Then script iterates over the files and calls FindDuplicate() function to find duplicates.
Syntax: FindDuplicate(Path) Parameter: Path: Path to folder having files Return Type: Dictionary
- The function FindDuplicate() takes path to file and calls Hash_File() function
- Then Hash_File() function is used to return HEXdigest of that file. For more info on HEXdigest read here.
Syntax: Hash_File(path) Parameters: path: Path of file Return Type: HEXdigest of file
- This MD5 Hash is then appended to a dictionary as key with file path as its value. After this,the FindDuplicate() function returns a dictionary in which keys has multiple values .i.e. duplicate files.
- Now Join_Dictionary() function is called which joins the dictionary returned by FindDuplicate() and an empty dictionary.
Syntax: Join_Dictionary(dict1,dict2) Parameters: dict1, dict2: Two different dictionaries Return Type: Dictionary
- After this, we print the list of files having the same content using results.
Example:
We’re assuming here for example purposes that “text_1.txt”, “text_3.txt”, “text_4.txt” are files having the same content, and “text_2.txt”, “text_5.txt” are files having the same content.
# Importing Libraries import os
import sys
from pathlib import Path
import hashlib
def FindDuplicate(SupFolder):
# Duplic is in format {hash:[names]}
Duplic = {}
for file_name in files:
# Path to the file
path = os.path.join(folders, file_name)
# Calculate hash
file_hash = Hash_File(path)
# Add or append the file path to Duplic
if file_hash in Duplic:
Duplic[file_hash].append(file_name)
else :
Duplic[file_hash] = [file_name]
return Duplic
# Joins dictionaries def Join_Dictionary(dict_1, dict_2):
for key in dict_2.keys():
# Checks for existing key
if key in dict_1:
# If present Append
dict_1[key] = dict_1[key] + dict_2[key]
else :
# Otherwise Stores
dict_1[key] = dict_2[key]
# Calculates MD5 hash of file # Returns HEX digest of file def Hash_File(path):
# Opening file in afile
afile = open (path, 'rb' )
hasher = hashlib.md5()
blocksize = 65536
buf = afile.read(blocksize)
while len (buf) > 0 :
hasher.update(buf)
buf = afile.read(blocksize)
afile.close()
return hasher.hexdigest()
Duplic = {}
folders = Path( 'path/to/directory' )
files = sorted (os.listdir(folders))
for i in files:
# Iterate over the files
# Find the duplicated files
# Append them to the Duplic
Join_Dictionary(Duplic, FindDuplicate(i))
# Results store a list of Duplic values results = list ( filter ( lambda x: len (x) > 1 , Duplic.values()))
if len (results) > 0 :
for result in results:
for sub_result in result:
print ( '\t\t%s' % sub_result)
else :
print ( 'No duplicates found.' )
|
Output: