Efficient SQL: Deduplicate Rows Without Losing Data
Duplicates in your database can lead to bloated storage, inaccurate analytics, and unexpected behavior. But purging them blindly can mean losing valuable data. In this article, we’ll walk through practical SQL techniques to remove duplicate rows while preserving the most relevant version—commonly based on timestamps or other criteria like status or field priority. Whether you’re cleaning up a stale import or enforcing uniqueness retroactively, these queries will save you hours of pain and uncertainty.
1. Understanding Duplicates: What and Why
In SQL, duplicates aren’t always line-for-line clones. Very often, they’re rows that share some identifying fields (like email or user_id) but differ in others (like created_at or status). Your goal isn’t to blindly delete all but one — it’s to keep the best one.
Consider a typical table of login sessions:
CREATE TABLE user_sessions (
id SERIAL PRIMARY KEY,
user_id INT,
session_token TEXT,
created_at TIMESTAMP,
status TEXT
);
You may find multiple entries for the same user_id due to bugs or load issues, but only one is needed — say, the latest or the one that’s ‘active’.
2. The ROW_NUMBER() Approach: Keep the Most Recent Row
The ROW_NUMBER() window function is ideal when you can define a clear sort order. For example, keeping the latest session per user:
WITH ranked_sessions AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC
) AS rn
FROM user_sessions
)
DELETE FROM user_sessions
WHERE id IN (
SELECT id FROM ranked_sessions WHERE rn > 1
);
This query ranks each row per user_id by created_at (most recent first), and deletes all but the first. We use a CTE (common table expression) to assign row numbers with ROW_NUMBER() and delete those with rn > 1.
3. Deduplication with Custom Rules (e.g., Status Prioritization)
Sometimes, timestamp isn’t enough. Suppose you want to prefer records where status = 'active' over others:
WITH prioritized AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY
CASE status
WHEN 'active' THEN 1
WHEN 'pending' THEN 2
ELSE 3
END,
created_at DESC
) AS rn
FROM user_sessions
)
DELETE FROM user_sessions
WHERE id IN (
SELECT id FROM prioritized WHERE rn > 1
);
Here, we assign the highest priority to ‘active’ sessions by sorting the status via a CASE expression. We then break ties using the timestamp. This method offers better control when deduplication involves business rules.
4. Creating a Clean Copy Instead of In-Place Deletion
If deleting feels dangerous (say on production), you can safely extract deduplicated rows into a new table. This is a common pattern during ETL or dataset preprocessing:
CREATE TABLE filtered_sessions AS
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC
) AS rn
FROM user_sessions
) AS ranked
WHERE rn = 1;
This approach keeps your original data intact while producing a deduplicated result. It’s especially useful if you plan to verify or transform the data before finally replacing it.
5. Performance and Best Practices
While effective, deduplication queries using window functions can become resource-intensive in large datasets. Here are a few optimization tips:
- Index your partition and order fields: Creating indexes on
user_idandcreated_atcan drastically speed up these operations. - Use smaller batches: When working with millions of records, break your work into ID ranges or time windows.
- Verify row uniqueness logic: Over-deduplication can result in accidental loss. Ensure your partition columns fully define the duplicates.
- Test first with SELECT: Before executing DELETE, always test your query logic using
SELECTto confirm what you’re going to delete.
Conclusion
Cleaning duplicates isn’t just about “removing copies” — it’s about enforcing data accuracy by keeping the right version. With SQL window functions like ROW_NUMBER() and the flexibility of CTEs, you gain the power to define exactly what constitutes a duplicate and which record should survive.
By applying the techniques laid out in this guide, you’ll write cleaner, more accurate reports, improve data quality for downstream systems, and reduce storage waste without blindly dropping rows. Use with care — and always against a backup or staging dataset first!
Useful links:


