How to Remove All Duplicate Rows Except One in MariaDB?

Duplicate rows in database tables can lead to complexity and data integrity issues and affect performance. Removing all duplicate rows while keeping one instance of each unique row is important for maintaining a clean database. In this article, we will explore various methods with the help of examples to keep our database clean and optimized in MariaDB.

How to Remove All Duplicate Rows Except One Row?

When working with a database, it is common to encounter duplicate rows. These duplicates can arise from various sources such as data entry errors, incomplete normalization, or multiple inserts.

Removing all duplicate rows except one can be challenging but is essential for database maintenance and ensuring data accuracy. Below methods used to Remove All Duplicate Rows Except One record are as follows:

Using GROUP BY and MIN/MAX Functions
Using ROW_NUMBER() Window Function
Using a Temporary Table

Let’s set up an environment to remove all duplicate rows

To understand How to Remove All Duplicate Rows Except One in MariaDB, we need a table structure on which we will perform various operations and queries. Here we will consider a table called employees which contains emp_id, emp_name, and emp_email as Columns.

CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    emp_name VARCHAR(50),
    emp_email VARCHAR(100)
);

-- Adding sample records to the employees table
INSERT INTO employees (emp_id, emp_name, emp_email) VALUES
(1, 'John Doe', 'john@example.com'),
(2, 'Jane Smith', 'jane@example.com'),
(3, 'John Doe', 'john@example.com'),
(4, 'Alice Johnson', 'alice@example.com'),
(5, 'Bob Williams', 'bob@example.com'),
(6, 'Jane Smith', 'jane@example.com');

Output:

Explanation: Our table has been created.

1. Using GROUP BY and MIN/MAX Functions

One common approach is to use the GROUP BY clause along with aggregate functions such as MIN or MAX to select one representative row for each set of duplicates. This method assumes that there are no other relevant columns that differentiate between duplicate rows.

Example:

-- Removing duplicate rows using GROUP BY and MIN/MAX functions
DELETE e1 FROM employees e1
JOIN employees e2 ON e1.emp_id > e2.emp_id
AND e1.emp_name = e2.emp_name
AND e1.emp_email = e2.emp_email;
-- Show the contents of the employees table after removing duplicates
SELECT * FROM employees;

Output:

Explanation: In the above query, We removes duplicate rows from the employees table by comparing each row e1 with every other row e2 based on the emp_id, emp_name, and emp_email columns. If a row e1 has a higher emp_id than another row e2 but the same emp_name and emp_email, it means that e1 is a duplicate and e2 is the original row to be kept.

The query uses a self-join to identify and delete duplicate rows. After executing the DELETE statement, the SELECT statement is used to display the contents of the employees table to show that duplicates have been removed.

2. Using ROW_NUMBER() Window Function

MariaDB has introduced support for window functions including ROW_NUMBER() which assigns a unique sequential integer to each row within a partition of a result set. We can take advantage of this function to remove duplicate rows.

Example:

-- Removing duplicate rows using ROW_NUMBER() window function
DELETE FROM employees
WHERE (emp_name, emp_email) IN (
    SELECT emp_name, emp_email
    FROM (
        SELECT emp_name, emp_email, 
               ROW_NUMBER() OVER (PARTITION BY emp_name, emp_email ORDER BY emp_id) AS rn
        FROM employees
    ) AS sub
    WHERE rn > 1
);
-- Show the contents of the employees table after removing duplicates
SELECT * FROM employees;

Output:

Explanation: In the above query, We removes duplicate rows from the employees table using the ROW_NUMBER() window function. It assigns a unique sequential integer (rn) to each row within a partition of the result set defined by emp_name and emp_email, ordered by emp_id.

The innermost SELECT statement generates row numbers for each duplicate row (rn > 1) within each partition. The middle SELECT statement retrieves emp_name and emp_email along with their row numbers. The outer DELETE statement then deletes rows where emp_name and emp_email are duplicates (rn > 1).

Finally, the SELECT statement displays the contents of the employees table after removing duplicates.

3. Using a Temporary Table

Another method involves using a temporary table to store the unique rows and then replacing the original table with the temporary one.

Example:

-- Removing duplicate rows using a temporary table
CREATE TABLE temp_employees AS
SELECT DISTINCT *
FROM employees;


DROP TABLE employees;

ALTER TABLE temp_employees RENAME TO employees;
-- Show the contents of the employees table after removing duplicates
SELECT * FROM employees;

Output:

Explanation: In the above query, We creates a new table called temp_employees by selecting all distinct rows from the employees table. It then drops the original employees table and renames temp_employees to employees effectively replacing the original table with one containing only unique rows. Finally it displays the contents of the employees table after removing duplicates.

Conclusion

In this article, we explored multiple methods to remove duplicate rows except one in MariaDB. with the help of features such as GROUP BY, window functions like ROW_NUMBER() and temporary tables you can efficiently eliminate duplicates while preserving one instance of each unique row. Whether you’re dealing with data cleaning tasks or ensuring data integrity in your database, these methods provide effective solutions for managing duplicate records in MariaDB tables.

Article Tags :

Databases

MariaDB