SQL Query Recipes for Cleaning Dirty Data

Dirty data is a persistent challenge in every data system. Whether it’s duplicate records, null values, inconsistent formatting, or typos, bad data can wreak havoc on analytics, reporting, and application behavior. Thankfully, SQL provides powerful tools for detecting and correcting these issues. In this post, we’ll walk through practical SQL query recipes to clean and prepare your data for real-world use.

1. Detecting and Removing Duplicate Records

Duplicate rows can creep in during data ingestion, migrations, or manual entry. Let’s identify and remove them while preserving one unique instance.

-- Find duplicate rows based on one or more columns
SELECT name, email, COUNT(*) as count
FROM users
GROUP BY name, email
HAVING COUNT(*) > 1;

-- Remove duplicates, keeping the lowest ID
DELETE FROM users
WHERE id NOT IN (
  SELECT MIN(id)
  FROM users
  GROUP BY name, email
);

This method identifies duplicates based on the name and email combination and keeps only the record with the smallest id. Always back up your data before running delete operations!

2. Identifying and Handling NULL Values

NULLs are placeholders for missing data, but they can cause bugs and skew reports if not handled correctly. Here’s how to identify and deal with them.

-- Find rows with NULLs in important columns
SELECT *
FROM orders
WHERE customer_id IS NULL;

-- Fill NULLs with default values
UPDATE orders
SET customer_id = 0
WHERE customer_id IS NULL;

-- Or use COALESCE to handle NULLs in queries
SELECT order_id, COALESCE(discount, 0) AS discount
FROM orders;

Use COALESCE() to substitute NULLs on-the-fly during SELECTs, or preemptively fix them using UPDATE when appropriate.

3. Finding and Fixing Text Case or Formatting Inconsistencies

Inconsistent formatting—such as “USA”, “usa”, and “Usa”—can interfere with grouping and searching.

-- Normalize text case
UPDATE customers
SET country = UPPER(country);

-- Or use lower for emails
UPDATE users
SET email = LOWER(email);

-- Find inconsistent formats (example: phone numbers)
SELECT phone
FROM contacts
WHERE phone NOT LIKE '(___) ___-____';

Ensure textual consistency using functions like UPPER(), LOWER(), and regex-like pattern matching with LIKE. Use constraints in your schema for long-term enforcement (e.g., CHECK constraints or triggers).

4. Detecting and Standardizing Typos or Misspellings

Data entered by hand can include typos, which are hard to detect, especially in free-text fields like city or product names.

-- Use approximate string matching (requires pg_trgm in PostgreSQL)
SELECT name
FROM products
WHERE name % 'iPhone';

-- Correct known typos with CASE or lookup tables
UPDATE products
SET name =
  CASE
    WHEN name = 'iphon' THEN 'iPhone'
    WHEN name = 'iphne' THEN 'iPhone'
    ELSE name
  END
WHERE name IN ('iphon', 'iphne');

If your SQL dialect supports it, use similarity matching or fuzzy joins to catch typos. Otherwise, maintain a correction dictionary to kill known bad values.

5. Validating and Enforcing Consistent Data Types

Sometimes data in text fields should really be numeric or date types—especially after ingesting CSVs or external data.

-- Find rows where dates are stored as invalid strings
SELECT *
FROM events
WHERE TO_DATE(event_date, 'YYYY-MM-DD') IS NULL;

-- Add constraints to enforce validation in the future
ALTER TABLE events
ADD CONSTRAINT chk_date_format CHECK (
    event_date ~ '^\\d{4}-\\d{2}-\\d{2}$'
);

Use pattern-matching and casting to locate malformed entries. You can also use a staging table to validate before inserting into production tables.

Conclusion

SQL is not just for querying—it’s your first line of defense in maintaining clean, accurate data. By incorporating these recipes into your ETL pipelines, ad hoc reports, or data audits, you can build more trustworthy systems and deliver higher-quality insights. As a next step, consider packaging these patterns into stored procedures, triggers, or automated scripts to keep your data pristine over time.

Useful links: