Finding duplicate words in writing: a handy Perl script

Posted on March 1, 2006

While we read, our minds subconsciously correct mistakes and overlook omissions in the steam of words we see, especially when reading familiar texts. This mental feature, which allows us to skim long documents, has a nasty drawback when we are writing: it makes it our own mistakes harder to spot.

One of the most common writing mistakes that our brains stealthily correct is the the duplicate word problem. For example, I inserted a double the into the previous sentence. Did you catch it?

If so, don’t be too proud of your accomplishment. It is easier to see errors in others’ writing than in your own. Your brain is attuned to your natural writing patterns and much more likely to repair your mistakes without your knowing.

To overcome this problem, some writers recommend reading your work backward, but I think computers are a more practical solution.

Here’s the Perl script that I use to spot duplicate words:

#!/usr/bin/perl -n00
# dupwords.pl - find duplicate words in the input stream

print "$ARGV: para $.: ($1)\n"
    while /(\b(\w+)\b\s+\b\2\b)/sg;

I use this script from Emacs via shell-command-on-region. I also use it from the command line to find duplicate-word errors in batch:

find . -name '*.txt' | xargs dupwords.pl

The duplicate-words problem is a favorite for programming cookbooks, so if you don’t like my recipe (or Perl), you have many other options.