Open In App

Deleting Duplicate Files Using Python

In this article, we are going to use a concept called hashing to identify unique files and delete duplicate files using Python.

Modules required:

pip install tk
pip install hashlib

Approach:

Stepwise Implementation

Step 1:  Import Tkinter, os, hashlib & pathlib libraries.






from tkinter.filedialog import askdirectory
from tkinter import Tk
import os
import hashlib
from pathlib import Path

Step 2: We are using tk.withdraw because we don’t want the GUI window of tkinter to be appearing on our screen we only want the file dialog for selecting the folder. askdirectory(title=”Select a folder”) this line of code pop ups a dialog box on the screen through which we can select a folder.




Tk().withdraw()
file_path = askdirectory(title="Select a folder")

Step 3: Next we need to list out all the files inside our root folder. To do that we need OS module, os.walk() takes the path of our root folder as an argument and it will walk through each subdirectory of the folder given to it and it will list out all the files. This function returns a list of tuples with three elements. The first element is the path to that folder and the second element is all the subfolders inside that folder and the third element is list of all the files inside that folder. 






list_of_files = os.walk(file_path)

Step 4: Our final goal is to list out all the files in each and every subdirectory and the main directory that’s why we are running a for loop on all the files. We need to open up each and every file and convert it into a hash string in order to do that we will define a variable called hash_file. md5 hash function will convert all the content of our file into md5 hash. In order to open a file we need to first have the path to it so here we are using another function in os module called os.path.join(). So we’ll say open the file using file path in read mode. This will convert our file into a md5 hash. In order to get the hash string we are going to use hexdigest() method. 




for root, folders, files in list_of_files:
    for file in files:
        file_path = Path(os.path.join(root, file))
        Hash_file = hashlib.md5(open(
          file_path,'rb').read()).hexdigest()

Step 5: In order to detect the duplicate files we are going to define an empty dictionary. We will add elements to this dictionary and the key of each element is going to be file hash and the value is going to be the file path. If file hash has already been added to this unique files dictionary that means that we have found a duplicate file and we need to delete that file so we’ll simply delete that file using os.remove() function. If it’s not there then we are going to add it to that dictionary.




unique_files = dict()
if Hash_file not in unique_files:
    unique_files[Hash_file] = file_path
else:
    os.remove(file_path)
    print(f"{file_path} has been deleted")

Below is the full implementation:




from tkinter.filedialog import askdirectory
  
# Importing required libraries.
from tkinter import Tk
import os
import hashlib
from pathlib import Path
  
# We don't want the GUI window of
# tkinter to be appearing on our screen
Tk().withdraw()
  
# Dialog box for selecting a folder.
file_path = askdirectory(title="Select a folder")
  
# Listing out all the files
# inside our root folder.
list_of_files = os.walk(file_path)
  
# In order to detect the duplicate
# files we are going to define an empty dictionary.
unique_files = dict()
  
for root, folders, files in list_of_files:
  
    # Running a for loop on all the files
    for file in files:
  
        # Finding complete file path
        file_path = Path(os.path.join(root, file))
  
        # Converting all the content of
        # our file into md5 hash.
        Hash_file = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
  
        # If file hash has already #
        # been added we'll simply delete that file
        if Hash_file not in unique_files:
            unique_files[Hash_file] = file_path
        else:
            os.remove(file_path)
            print(f"{file_path} has been deleted")

Output:


Article Tags :