ClusterBy: a handy little function for the toolbox

Posted on September 1, 2007

Tags: haskell, clusterby, puzzles, hof, functions

Via Reddit I found Mark Nelson’s post about a recent word puzzle from NPR’s Weekend Edition:

Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?

The puzzle is fairly straightforward to solve by hand (think about it), but let’s write a program to solve it. That will give us a convenient excuse to discuss a super-handy function I use all the time: clusterBy. In Haskell, it looks like this:

import Control.Arrow ((&&&))
import qualified Data.Map as M

clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++)
            . map (f &&& return)

What clusterBy does is group a list of values by their signatures, as computed by a given signature function f, and returns the groups in order of ascending signature. For example, we can cluster the words “the tan ant gets some fat” by length, by first letter, or by last letter just by changing the signature function we give to clusterBy:

*Main> let antwords = words "the tan ant gets some fat"

*Main> clusterBy length antwords
[["the","tan","ant","fat"],["gets","some"]]

*Main> clusterBy head antwords
[["ant"],["fat"],["gets"],["some"],["the","tan"]]

*Main> clusterBy last antwords
[["the","some"],["tan"],["gets"],["ant","fat"]]

If we use sort as the signature function, we can find anagrams:

*Main> clusterBy sort antwords
[["fat"],["tan","ant"],["gets"],["the"],["some"]]

And that brings us back to the original puzzle. To find the solution, we must consider each unique pair of state names to form a “word” and find the anagrams among a list of such “words.”

Assuming we are given a list of state names on standard input, one state per line, we can write the shell of our solution as follows:

main = mapM_ print . solve . lines =<< getContents

The shell delegates the real work to solve. It’s job is to compute the unique, 2-state combinations from the original list of states, and then find the anagrams among these combinations. As before, finding the anagrams is simply a matter of calling clusterBy with the right signature function. We also filter out the trivial results, which are not valid solutions:

solve = filter ((>1) . length) . clusterBy signature . ucombos
ucombos xs = [[x,y] | x <- xs, y <- xs, x < y]
signature = sort . filter isAlpha . concat   -- sort letters

That’s it. Now we can solve the puzzle by feeding our program a list of states:

$ runhaskell anagrams2.hs < states.txt

[["NORTH CAROLINA","SOUTH DAKOTA"],
 ["NORTH DAKOTA","SOUTH CAROLINA"]]

What a handy little function, that clusterBy.

Update: made clear that clusterBy returns clusters in order of ascending signature.

Update 2007-10-31: For more interesting discussion of clusterBy and the original puzzle from NPR, see Anders Pearson’s blog: A Simple Programming Puzzle Seen Through Three Different Lenses.