Bash Scripting Challenge: Find the Top N Largest Files in a Directory Tree

Managing disk space efficiently is a common concern for system administrators and developers alike. Files can grow large over time and silently hog disk space, especially on development machines or shared Linux servers. As such, having a reliable Bash script to identify the top N largest files in a directory tree can be a game-changer. In this guide, we’ll write a powerful Bash script using standard Unix tools like find, du, sort, and head to solve this challenge efficiently.

1. Problem Breakdown: What We Need to Achieve

We want to write a Bash script that will:

Traverse a given directory tree recursively
Measure the size of each file
Sort the files by size in descending order
Output the top N largest files with their sizes

This is particularly useful for finding logs, media files, backup dumps, or database exports that may be consuming excessive disk space.

2. Building Blocks: Understanding The Unix Tools

We’ll use the following Unix commands:

find: search for files recursively
du -b or stat -c%s: get file sizes in bytes
sort -nr: sort numbers in reverse (largest to smallest)
head -n: select the top N entries

A test run might look like this:

find . -type f -exec du -b {} + | sort -nr | head -n 10

This command lists the 10 largest files in the current directory tree. However, this approach can be fragile on large numbers of files or file names with special characters. Let’s fix that in our script.

3. Writing the Robust Bash Script

Let’s put this into a functional script that accepts parameters: the start directory and N (how many top files to return).

#!/bin/bash

# Usage: ./find_largest_files.sh /path/to/start N
DIR=${1:-.}
COUNT=${2:-10}

if [ ! -d "$DIR" ]; then
  echo "Error: Directory '$DIR' not found."
  exit 1
fi

find "$DIR" -type f -print0 | \
  while IFS= read -r -d '' file; do
    size=$(stat -c%s "$file" 2>/dev/null)
    echo -e "$size\t$file"
  done | sort -nr | head -n "$COUNT"

How it works:

find uses -print0 to handle unusual characters and whitespace in filenames.
The while loop reads file paths safely and uses stat -c%s to get file sizes in bytes.
Output is sorted numerically in reverse to show largest sizes first.
Only the top N lines are returned.

All of this makes the script robust for any practical filesystem traversal.

4. Usage Examples and Practical Tips

Save the script as find_largest_files.sh and make it executable:

chmod +x find_largest_files.sh

Example usage to find the top 15 largest files in your home directory:

./find_largest_files.sh ~/ 15

Want to search only in /var/log and get the top 5 files?

./find_largest_files.sh /var/log 5

Pro tip: Use du -h for human-readable sizes if needed, but note it complicates sorting since units are not purely numeric:

find "$DIR" -type f -exec du -h {} + | sort -hr | head -n "$COUNT"

However, this might be less precise and slower for huge numbers of files.

5. Optimizations and Performance Considerations

This script is efficient for thousands of files but might slow down on millions. Here are some tips for better performance:

Avoid subdirectories you don’t want by using -prune or grep -v.
Use nice and ionice to reduce system load:

nice -n 19 ionice -c2 -n7 ./find_largest_files.sh / 20

Cache directory listings if you’re running the script many times.
Parallelize with xargs -P if processing with complex logic.

Still hitting performance limits? Consider writing a Python version using os.walk() and multithreading for even more control.

6. Conclusion

With just a few lines of Bash and standard Unix tools, you can efficiently find disk-hogging files anywhere in your system. This script is a great addition to your sysadmin or DevOps toolkit, useful for cleaning up servers or analyzing backups. Customize it, schedule it via cron, or include it in your health-check routines. Cleaning out the clutter has never been easier.

Happy scripting!

Useful links: