Using Python’s difflib to Build a Simple ‘Track Changes’ Tool
Comparison is at the heart of version control, document editing, and config management. Whether you’re reviewing law contracts or analyzing changes in server configs, being able to track changes between two text versions is essential. In this tutorial, we’ll explore how Python’s difflib
module, particularly the SequenceMatcher
class, can help us build a simple ‘track changes’ tool. You’ll learn how to highlight word-level differences in a clear and developer-friendly way.
1. Why Use difflib and SequenceMatcher?
Python’s difflib
module is part of the standard library and offers efficient tools for comparing sequences—be it strings, lists, or files. The real star is SequenceMatcher
, which identifies the longest contiguous matching subsequence that contains no “junk” elements (by default, whitespace).
Here’s a taste of what it can do:
from difflib import SequenceMatcher
a = "The quick brown fox"
b = "The swift brown fox"
matcher = SequenceMatcher(None, a, b)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
print(f"{tag}: a[{i1}:{i2}] -> b[{j1}:{j2}]")
Output:
equal: a[0:4] -> b[0:4]
replace: a[4:9] -> b[4:9]
equal: a[9:19] -> b[9:19]
This tells us that between indices 4 and 9, the content changed (“quick” -> “swift”). The other parts are identical.
2. Comparing Text at Word-Level
To build a diff viewer, we need to compare words, not characters. Let’s split our input text by whitespace and run SequenceMatcher
on those token lists.
def word_diff(a, b):
a_words = a.split()
b_words = b.split()
matcher = SequenceMatcher(None, a_words, b_words)
result = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
result.extend(a_words[i1:i2])
elif tag == "replace":
result.extend([f"[-{w}-]" for w in a_words[i1:i2]])
result.extend([f"[+{w}+]" for w in b_words[j1:j2]])
elif tag == "delete":
result.extend([f"[-{w}-]" for w in a_words[i1:i2]])
elif tag == "insert":
result.extend([f"[+{w}+]" for w in b_words[j1:j2]])
return ' '.join(result)
# Demo
original = "The quick brown fox jumps over the lazy dog"
revised = "The quick brown fox leaped over a lazy dog"
print(word_diff(original, revised))
Output:
The quick brown fox [-jumps-] [+leaped+] over [-the-] [+a+] lazy dog
This function visually marks deleted and inserted words using brackets. It’s simple but powerful for small-scale diffing.
3. Enhancing Output with HTML Formatting
Let’s apply some basic HTML tagging to make the results web-friendly. We will wrap deletions in red-strike tags and insertions in green-bold tags.
def word_diff_html(a, b):
a_words = a.split()
b_words = b.split()
matcher = SequenceMatcher(None, a_words, b_words)
result = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == "equal":
result.extend(a_words[i1:i2])
elif tag == "replace":
result.extend([f"{w}" for w in a_words[i1:i2]])
result.extend([f"{w}" for w in b_words[j1:j2]])
elif tag == "delete":
result.extend([f"{w}" for w in a_words[i1:i2]])
elif tag == "insert":
result.extend([f"{w}" for w in b_words[j1:j2]])
return ' '.join(result)
html_diff = word_diff_html(original, revised)
print(html_diff)
To style this properly, just add some CSS:
<style>
del { color: red; text-decoration: line-through; }
ins { color: green; font-weight: bold; }
</style>
When rendered in a browser, this highlights deletions in red and insertions in green. A great start for document comparison tools!
4. Using difflib.unified_diff for Multiline Documents
For config files or source code, working with line-level diffs can be more practical. Python’s unified_diff
handles this well.
from difflib import unified_diff
def file_diff(file1, file2):
with open(file1) as f1, open(file2) as f2:
a = f1.readlines()
b = f2.readlines()
diff = unified_diff(a, b, fromfile=file1, tofile=file2)
for line in diff:
print(line, end='')
# Usage file_diff('config_old.txt', 'config_new.txt')
This function gives you a git diff
-style output—excellent for integration into CLI tools or logging systems.
5. Performance Tips and Optimizations
1. Avoid reusing SequenceMatcher objects: Although SequenceMatcher
supports reuse with the set_seq1/set_seq2
methods, in most use cases it’s clearer and faster to create a new matcher each time.
2. Watch your input size: For very large documents (like legal audits), consider running diffs at a higher granularity first—e.g., by paragraph or sentence—before diving into word-level diffs.
3. Junk function optimization: If comparing structured config files where whitespace or comments can be ignored, pass in a custom isjunk
function to SequenceMatcher
.
matcher = SequenceMatcher(lambda x: x in ['#', '//'], a_lines, b_lines)
This ensures minor comments don’t trigger noisy changes.
Conclusion
Python’s difflib
module provides a surprisingly rich feature set for implementing change-tracking tools without heavyweight dependencies. From simple word diffs to full-text comparisons, it gives developers the flexibility they need to build anything from review tools to change logs.
Give this a try in your next document-automation or configuration-diffing project—and enjoy the beauty of seeing word-level edits with just a few lines of Python.
Useful links: