Open In App

How to Remove All Duplicate Rows Except One in PostgreSQL?

When working with databases, we often have duplicate rows in the table. These duplicates can arise due to various reasons such as erroneous data entry or data migration processes. Removing duplicates while retaining one instance of each is a frequent requirement in database maintenance tasks.

So to have the unique records, we need to delete the duplicate records except one in the database. So PostgreSQL provides different approaches to deleting duplicate rows. In this article, we will discuss the below-mentioned approaches on how to delete duplicate rows except one.



How to Remove All Duplicate Rows Except One in PostgreSQL

Removing duplicate rows in PostgreSQL while preserving one instance each is essential for data cleanliness. Two efficient methods are DELETE USING and subqueries with row_number(). These approaches maintain data integrity and simplify database management.

Setup an Environment

Creating a student table:



Let us first create a student table having columns s_id , s_name , dept_no using the below query:

create table student (s_id integer , s_name varchar(20) , dept_no integer )

Inserting values into the student table:

After creating a table, let us insert some duplicate values in the table using the below queries:

insert into student values ( 1 , ' Mohit ' , 10 );
insert into student values ( 2, ' Mohit ' , 10 );
insert into student values ( 3, ' Mohit ' , 10 );
insert into student values ( 4, ' Pavan ' , 20);
insert into student values ( 5, ' Pavan ' , 20);
insert into student values ( 6, ' Harish ' , 30);

We inserted 6 records in the student table where we have some duplicate records. To delete those duplicate records while preserving one record, you can use below mentioned approaches:

1. By DELETE USING Statement

The DELETE USING is a feature available in PostgreSQL that allows us to delete duplicate rows except one.

Syntax:

DELETE FROM table_name alias_name1 USING table_name alias_name2 WHERE conditions 

Example: The student table which we have created and inserted some records has some duplicate records. So let us delete the duplicate records while retaining one of them by combining the DELETE and USING statements. The below is the student having duplicate records.

Student table with duplicate rows

In the above table, there are 3 duplicate records with s_name = ‘ Mohit ‘ and two duplicate records with s_name = ‘ Pavan ‘. So we need to delete those while retaining one of those duplicate records. So we can use the below query:

DELETE from student s1 USING student s2 where s1.s_id > s2.s_id and s1.s_name = s2.s_name

Explanation: The above query has two aliases s1 and s2 for the same table student where we check s1.s_id is > s2.s_id and s1.s_name = s2.s_name which means the records having s_id’s greater than the previous duplicate records having equal s_name’s are deleted. So, the rows having s_id 3 and 2 are checked with 1 since their s_id is greater than 1 and have equal s_name, that’s why it deletes the 2nd row and 3rd row. In the same way, the s_id 5 is checked with 4, since 5 is greater than 4 and has equal s_names. Hence it deletes the 5th row. After executing the above query, our table looks like this:

Output:

Student table with unique rows after deleting duplicate rows

Now we can see that the query deleted two duplicate records with s_name = ‘ Mohit ‘ and one duplicate record with s_name = ‘ Pavan ‘ which was our requirement.

2. Using Subqueries with row_number( )

In this method, we are using sub-queries with row_number( ) to delete the duplicates except one. It follows three steps given below:

  1. Partitioning by duplicate values
  2. Assigning row numbers
  3. Filtering duplicate rows

Syntax:

DELETE FROM table_name WHERE col1 in 
(SELECT col1 FROM
(SELECT col1,row_number() OVER(partition by col2 order by col1 asc) AS row_num FROM table_name)alias_name
WHERE conditions);

Example: Let us take the student table having some duplicate records

student table with duplicate records

Query:

DELETE FROM student WHERE s_id in 
(SELECT s_id FROM
(SELECT s_id,row_number() OVER(partition by s_name order by s_id asc) AS row_num FROM student)s WHERE s.row_num>1);

Output:

Student table with unique rows after removing duplicate rows

Explanation: The subquery in the above query selects the s_id and it assigns a row number to each row using the row_number() function by partitioning the student table based on the s_name column and sorting the data in ascending order using order by clause. This subquery will return the duplicates except for the first record. After getting s_id from the sub query, it is selected by the second inner query which is then deleted by the outer DELETE FROM statement.

Conclusion

Managing data often involves the need to manipulate records while preserving some important records. In PostgreSQL, the ability to delete all records except one unique record is valuable for maintaining integrity and efficiency. This can be done by a special feature available in PostgreSQL which is combining DELETE and USING keywords and the other way is using DELETE statement with subqueries and conditions. But it is important to keep an eye while deleting records as it may lead to the loss of important data records.


Article Tags :