Open In App
Related Articles

Eliminating repeated lines from a file using Python

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report

Let us see how to delete several repeated lines from a file using Python’s File Handling power. If the file is small with a few lines, then the task of deleting/eliminating repeated lines from it could be done manually, but when it comes to large files, this is where Python comes to your rescue.

Eliminating repeated lines from a file in Python

Below are the methods that we will cover in this article:

Input File:

For the sake of this example let’s create a file (Lorem_input.txt) with some Ipsum text in it. All the repeated lines are marked in bold.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.

Eliminating Repeated Lines from a File using a List

Open the input file using the open() function and pass in the flag -r to open in reading mode then open an output file, using the -w flag, where we would store the contents of the file after deleting all repeated lines from it. Using the list() method keep track of all the lines seen so far so that we can compare it with the current reading line. Now, iterate over each line of the input file and compare it with the lines seen so far. If the current line is also present in lines seen so far, then skip that line else write that line to the output file, and don’t forget to add the current line to the lines seen so far. Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.

Python3

def remove_duplicates(input_file, output_file):
    lines_seen = set()
    with open(output_file, 'w') as out_file:
        with open(input_file, 'r') as in_file:
            for line in in_file:
                if line not in lines_seen:
                    out_file.write(line)
                    lines_seen.add(line)
 
# Usage
input_file = open('C:/Users/user/Desktop/Lorem_input.txt', "r")
output_file = open('C:/Users/user/Desktop/Lorem_output.txt', "w")
remove_duplicates(input_file, output_file)

                    

Output file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.

Eliminating Repeated Lines from a File using a Set

Open the input file using the open() function and pass in the flag -r to open in reading mode. Open an output file, using the -w flag, where we would store the contents of the file after deleting all repeated lines from it. Using the set() method keep track of all the lines seen so far so that we can compare it with the current reading line. Now, iterate over each line of the input file and compare it with the lines seen so far. If the current line is also present in lines seen so far, then skip that line else write that line to the output file, and don’t forget to add the current line to the lines seen so far and close the files.

Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.

Python3

# creating the output file
outputFile = open('C:/Users/user/Desktop/Lorem_output.txt', "w")
 
# reading the input file
inputFile = open('C:/Users/user/Desktop/Lorem_input.txt', "r")
 
# holds lines already seen
lines_seen_so_far = set()
 
# iterating each line in the file
for line in inputFile:
 
    # checking if line is unique
    if line not in lines_seen_so_far:
 
        # write unique lines in output file
        outputFile.write(line)
 
        # adds unique lines to lines_seen_so_far
        lines_seen_so_far.add(line)       
 
# closing the file
inputFile.close()
outputFile.close()

                    

Running the above Python script will remove all the repeated lines from the input file and write the modified file to the output file. After running this script the output file(Lorem_output.txt) will look something like this

Output file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.

Eliminating Repeated Lines from a File using Pandas

The function remove_duplicates reads the data from the input file, represented by the input_file variable, into a pandas DataFrame.Then it uses the drop_duplicates method to remove duplicate rows in place. The resulting unique data frame is saved back to a new file specified by the output_file variable. The correct file paths are defined, and the function is called with the input and output file paths to eliminate repeated lines from the input file and save the unique content to the output file. Now let’s create an empty output file (Lorem_output.txt), where we will store the modified input file.

Python3

import pandas as pd
 
def remove_duplicates(input_file, output_file):
    df = pd.read_csv(input_file, header=None)
    df.drop_duplicates(inplace=True)
    df.to_csv(output_file, header=False, index=False)
 
# creating the output file
outputFile = open('C:/Users/user/Desktop/Lorem_output.txt', "w")
# reading the input file
inputFile = open('C:/Users/user/Desktop/Lorem_input.txt', "r")
remove_duplicates(input_file, output_file)

                    

Output file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus est neque, mollis vel massa vel, condimentum facilisis ipsum. Mauris vitae mollis magna. Aliquam laoreet vitae nisi quis rutrum. Sed ut ligula nec enim consequat egestas vel a sapien. Pellentesque sit amet euismod felis. Pellentesque in nibh ultricies, convallis sapien id, sagittis odio. Vivamus placerat ex sed ligula porttitor dignissim. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi posuere eget odio ut venenatis. Nam lobortis bibendum maximus. Donec venenatis sapien sed varius accumsan.



Last Updated : 06 Aug, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads