ClusterBy: a handy little function for the toolbox
Posted by Tom Moertel Sat, 01 Sep 2007 19:39:00 GMT
Via Reddit I found Mark Nelson’s post about a recent word puzzle from NPR’s Weekend Edition:
Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?
The puzzle is fairly straightforward to solve by hand (think about it), but let’s write a program to solve it. That will give us a convenient excuse to discuss a super-handy function I use all the time: clusterBy. In Haskell, it looks like this:
import Control.Arrow ((&&&))
import qualified Data.Map as M
clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++)
. map (f &&& return)
What clusterBy does is group a list of values by their signatures, as computed by a given signature function f, and returns the groups in order of ascending signature. For example, we can cluster the words “the tan ant gets some fat” by length, by first letter, or by last letter just by changing the signature function we give to clusterBy:
*Main> let antwords = words "the tan ant gets some fat"
*Main> clusterBy length antwords
[["the","tan","ant","fat"],["gets","some"]]
*Main> clusterBy head antwords
[["ant"],["fat"],["gets"],["some"],["the","tan"]]
*Main> clusterBy last antwords
[["the","some"],["tan"],["gets"],["ant","fat"]]
If we use sort as the signature function, we can find anagrams:
*Main> clusterBy sort antwords
[["fat"],["tan","ant"],["gets"],["the"],["some"]]
And that brings us back to the original puzzle. To find the solution, we must consider each unique pair of state names to form a “word” and find the anagrams among a list of such “words.”
Assuming we are given a list of state names on standard input, one state per line, we can write the shell of our solution as follows:
main = mapM_ print . solve . lines =<< getContents
The shell delegates the real work to solve. It’s job is to compute the unique, 2-state combinations from the original list of states, and then find the anagrams among these combinations. As before, finding the anagrams is simply a matter of calling clusterBy with the right signature function. We also filter out the trivial results, which are not valid solutions:
solve = filter ((>1) . length) . clusterBy signature . ucombos
ucombos xs = [[x,y] | x <- xs, y <- xs, x < y]
signature = sort . filter isAlpha . concat -- sort letters
That’s it. Now we can solve the puzzle by feeding our program a list of states:
$ runhaskell anagrams2.hs < states.txt[["NORTH CAROLINA","SOUTH DAKOTA"],
["NORTH DAKOTA","SOUTH CAROLINA"]]What a handy little function, that clusterBy.
Update: made clear that clusterBy returns clusters in order of ascending signature.
Update 2007-10-31: For more interesting discussion of clusterBy and the original puzzle from NPR, see Anders Pearson’s blog: A Simple Programming Puzzle Seen Through Three Different Lenses.
readers
That’s a neat function !
P!
I tried to implement this in Ruby – of course it’s not as good as the Haskell version, and there’s probably a much more elegant way to do it, but here it is:
[I tweaked the code to fit into the narrow column width. —Tom]
My question is in regard to the “signature”, how does collectTo define organizing “ascending” with arbitrary types of the signature? I suppose it just calls “sort”, which itself can support arbitrary types. Thoughts?
Paul, thanks for the Ruby implementation. I was able to shorten it by having new Hash slots default to empty arrays:
Regarding your question:
It uses the natural ordering of the signature values.
If you look at the type of clusterBy, you’ll see that the result of the signature function f has the type b and that b is qualified to be a member of the Ord type class. That means that any signature function you provide to clusterBy must produce results that can be ordered (and hence sorted). The sorting itself happens implicitly: Data.Map stores keys in an ordered tree and its elems function is defined to “return all elements of the map in the ascending order of their keys.”
Thanks for your comment!
Cheers,
Tom
You can also do
The same semantics are in turn implementable in terms of two other functions that I continually find helpful:
I suspect this is less efficient, though.
Oops. What I wrote isn’t true – groupBy won’t put a pair of elements that satisfies the predicate into the same group if they are separated by an element that belongs in a different group.
In other words, order matters to groupBy but not to clusterBy.
Strangely enough, I wrote something similar the day before yesterday… I was making a map from morse values (lists of dit|dah) to strings.
Seems like clusterBy could be quite a common method!
Whoops, that was a later version. The earlier one had just (++) instead of that lambda.
ClusterBy is very nice, my common lisp solution http://paste.lisp.org/display/50081
cheers Slobodan Blazeski
If you
fromListWith (flip (++)), you shouldn’t have tomap reverse, iiuc.Nick, the problem with using flip
(++)as the combining function for Map.fromListWith is that it results in quadratic run-time w.r.t. cluster size. By making sure the new element (which is passed as the first argument to the combining function) is always added to the head of the list, we are able to implement cluster extension with a single cons operation—O(1).If you wanted to keep the low cost and eliminate the reverse step, you could use Data.Sequence values instead of lists to represent clusters. For clarity, however, I just used lists.
Cheers,
Tom