Efficient SQL: Deduplicate Rows Without Losing Data

Duplicates in your database can lead to bloated storage, inaccurate analytics, and unexpected behavior. But purging them blindly can mean losing valuable data. In this article, we’ll walk through practical SQL techniques to remove duplicate rows while preserving the most relevant version—commonly based on timestamps or other criteria like status or field priority. Whether you’re cleaning up a stale import or enforcing uniqueness retroactively, these queries will save you hours of pain and uncertainty.

1. Understanding Duplicates: What and Why

In SQL, duplicates aren’t always line-for-line clones. Very often, they’re rows that share some identifying fields (like email or user_id) but differ in others (like created_at or status). Your goal isn’t to blindly delete all but one — it’s to keep the best one.

Consider a typical table of login sessions:

CREATE TABLE user_sessions (
    id SERIAL PRIMARY KEY,
    user_id INT,
    session_token TEXT,
    created_at TIMESTAMP,
    status TEXT
);

You may find multiple entries for the same user_id due to bugs or load issues, but only one is needed — say, the latest or the one that’s ‘active’.

2. The ROW_NUMBER() Approach: Keep the Most Recent Row

The ROW_NUMBER() window function is ideal when you can define a clear sort order. For example, keeping the latest session per user:

WITH ranked_sessions AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY user_id 
               ORDER BY created_at DESC
           ) AS rn
    FROM user_sessions
)
DELETE FROM user_sessions
WHERE id IN (
    SELECT id FROM ranked_sessions WHERE rn > 1
);

This query ranks each row per user_id by created_at (most recent first), and deletes all but the first. We use a CTE (common table expression) to assign row numbers with ROW_NUMBER() and delete those with rn > 1.

3. Deduplication with Custom Rules (e.g., Status Prioritization)

Sometimes, timestamp isn’t enough. Suppose you want to prefer records where status = 'active' over others:

WITH prioritized AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY user_id 
               ORDER BY 
                   CASE status 
                       WHEN 'active' THEN 1
                       WHEN 'pending' THEN 2
                       ELSE 3 
                   END, 
                   created_at DESC
           ) AS rn
    FROM user_sessions
)
DELETE FROM user_sessions
WHERE id IN (
    SELECT id FROM prioritized WHERE rn > 1
);

Here, we assign the highest priority to ‘active’ sessions by sorting the status via a CASE expression. We then break ties using the timestamp. This method offers better control when deduplication involves business rules.

4. Creating a Clean Copy Instead of In-Place Deletion

If deleting feels dangerous (say on production), you can safely extract deduplicated rows into a new table. This is a common pattern during ETL or dataset preprocessing:

CREATE TABLE filtered_sessions AS
SELECT * FROM (
    SELECT *, 
           ROW_NUMBER() OVER (
               PARTITION BY user_id 
               ORDER BY created_at DESC
           ) AS rn
    FROM user_sessions
) AS ranked
WHERE rn = 1;

This approach keeps your original data intact while producing a deduplicated result. It’s especially useful if you plan to verify or transform the data before finally replacing it.

5. Performance and Best Practices

While effective, deduplication queries using window functions can become resource-intensive in large datasets. Here are a few optimization tips:

Index your partition and order fields: Creating indexes on user_id and created_at can drastically speed up these operations.
Use smaller batches: When working with millions of records, break your work into ID ranges or time windows.
Verify row uniqueness logic: Over-deduplication can result in accidental loss. Ensure your partition columns fully define the duplicates.
Test first with SELECT: Before executing DELETE, always test your query logic using SELECT to confirm what you’re going to delete.

Conclusion

Cleaning duplicates isn’t just about “removing copies” — it’s about enforcing data accuracy by keeping the right version. With SQL window functions like ROW_NUMBER() and the flexibility of CTEs, you gain the power to define exactly what constitutes a duplicate and which record should survive.

By applying the techniques laid out in this guide, you’ll write cleaner, more accurate reports, improve data quality for downstream systems, and reduce storage waste without blindly dropping rows. Use with care — and always against a backup or staging dataset first!

Useful links: