Detect Duplicate Files on Your System with a Python Hashing Script

Duplicate files can quietly eat up precious disk space, especially on systems that see heavy data movement or where users frequently transfer files between directories. In this blog post, we’ll build a Python script that recursively scans directories, computes file hashes, and identifies duplicates. This approach is accurate, scalable, and a great example of real-world automation in Python.

1. Why Hashing Is the Right Strategy for Duplicate Detection

Before jumping into code, it’s helpful to understand why hashing is effective for finding duplicates. Two files are considered duplicates if they have identical content. Rather than using naive byte-by-byte comparisons — which can be inefficient — we use cryptographic hash functions (like SHA-256) to generate a unique fingerprint for each file. If two files produce the same hash, they’re extremely likely to be duplicates.

The hashlib module in Python provides robust hashing algorithms that are efficient and secure:

import hashlib

# Compute SHA256 hash of a given file
def hash_file(path):
    hasher = hashlib.sha256()
    with open(path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            hasher.update(chunk)
    return hasher.hexdigest()

This function reads the file in chunks, which minimizes memory usage even for large files.

2. Recursively Scan Files in a Directory

Our next step is to walk through a given directory and collect paths to all the files. To do this recursively while ignoring symbolic links and inaccessible files, we’ll use Python’s os and os.path modules:

import os

def get_all_files(directory):
    file_paths = []
    for root, dirs, files in os.walk(directory):
        for name in files:
            filepath = os.path.join(root, name)
            if os.path.isfile(filepath):
                file_paths.append(filepath)
    return file_paths

This function collects all valid file paths under the specified directory while skipping subdirectories and invalid file types. If you’re working with large directory structures or network drives, this method remains efficient.

3. Build a Hash Map to Track File Duplicates

Now that we can compute a file’s hash and list all files in a directory, we can combine the two into a dictionary that maps each hash to a list of file paths:

def find_duplicates(directory):
    hashes = {}
    all_files = get_all_files(directory)

    for path in all_files:
        try:
            file_hash = hash_file(path)
            if file_hash in hashes:
                hashes[file_hash].append(path)
            else:
                hashes[file_hash] = [path]
        except (OSError, PermissionError):
            print(f"Skipping inaccessible file: {path}")

    return {k: v for k, v in hashes.items() if len(v) > 1}

This function efficiently collects all duplicate files and groups them by their shared hash. The final dictionary keeps only those hashes that are linked to more than one file — our duplicates.

4. Reporting and Cleaning Duplicate Files

Once we’ve identified duplicate files, we can print them out in a human-readable format or optionally remove them to free up disk space:

def report_duplicates(duplicates):
    for hashcode, files in duplicates.items():
        print(f"\nDuplicate group (Hash: {hashcode}):")
        for i, path in enumerate(files):
            print(f"  {i+1}. {path}")

# Optional deletion logic (keep one copy)
def delete_duplicates(duplicates):
    for files in duplicates.values():
        for path in files[1:]:  # Keep first, delete the rest
            try:
                os.remove(path)
                print(f"Deleted: {path}")
            except Exception as e:
                print(f"Error deleting {path}: {e}")

This gives you the power to either audit duplicate files or reclaim space safely by deleting redundant ones.

5. Wrapping It All Together: A Complete CLI Tool

Let’s wrap this into a runnable script with argument parsing using argparse to make it user-friendly:

import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="Find and remove duplicate files based on content hash.")
    parser.add_argument('directory', help='Root directory to scan for duplicates')
    parser.add_argument('--delete', action='store_true', help='Delete duplicate files')
    args = parser.parse_args()

    duplicates = find_duplicates(args.directory)
    if not duplicates:
        print("No duplicates found.")
    else:
        report_duplicates(duplicates)

        if args.delete:
            confirm = input("Are you sure you want to delete duplicates? [y/N]: ").lower()
            if confirm == 'y':
                delete_duplicates(duplicates)

Save this script as find_duplicates.py and run it from the command line as follows:

$ python find_duplicates.py /path/to/search --delete

This turns your duplicate-finding script into a powerful one-liner disk cleanup utility!

Final Thoughts and Optimization Tips

Memory Efficiency: Reading files in chunks avoids loading entire files into memory.
Performance: For extremely large systems, consider multi-threading for hashing operations.
False Positives: Using SHA-256 makes hash collisions virtually impossible — safe enough for file comparison.
Symlink Awareness: You could enhance this tool to detect and skip symlinks or duplicate folders.

This script is an excellent foundation for more advanced file management automation. Whether it’s for personal disk cleanup or system maintenance utils, hashing for duplicates is a clean, reliable solution.

Useful links: