ClusterBy: a handy little function for the toolbox

Posted by Tom Moertel Sat, 01 Sep 2007 19:39:00 GMT

Via Reddit I found Mark Nelson’s post about a recent word puzzle from NPR’s Weekend Edition:

Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?

The puzzle is fairly straightforward to solve by hand (think about it), but let’s write a program to solve it. That will give us a convenient excuse to discuss a super-handy function I use all the time: clusterBy. In Haskell, it looks like this:

import Control.Arrow ((&&&))
import qualified Data.Map as M

clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++)
            . map (f &&& return)

What clusterBy does is group a list of values by their signatures, as computed by a given signature function f, and returns the groups in order of ascending signature. For example, we can cluster the words “the tan ant gets some fat” by length, by first letter, or by last letter just by changing the signature function we give to clusterBy:

*Main> let antwords = words "the tan ant gets some fat"

*Main> clusterBy length antwords
[["the","tan","ant","fat"],["gets","some"]]

*Main> clusterBy head antwords
[["ant"],["fat"],["gets"],["some"],["the","tan"]]

*Main> clusterBy last antwords
[["the","some"],["tan"],["gets"],["ant","fat"]]

If we use sort as the signature function, we can find anagrams:

*Main> clusterBy sort antwords
[["fat"],["tan","ant"],["gets"],["the"],["some"]]

And that brings us back to the original puzzle. To find the solution, we must consider each unique pair of state names to form a “word” and find the anagrams among a list of such “words.”

Assuming we are given a list of state names on standard input, one state per line, we can write the shell of our solution as follows:

main = mapM_ print . solve . lines =<< getContents

The shell delegates the real work to solve. It’s job is to compute the unique, 2-state combinations from the original list of states, and then find the anagrams among these combinations. As before, finding the anagrams is simply a matter of calling clusterBy with the right signature function. We also filter out the trivial results, which are not valid solutions:

solve = filter ((>1) . length) . clusterBy signature . ucombos
ucombos xs = [[x,y] | x <- xs, y <- xs, x < y]
signature = sort . filter isAlpha . concat   -- sort letters

That’s it. Now we can solve the puzzle by feeding our program a list of states:

$ runhaskell anagrams2.hs < states.txt
[["NORTH CAROLINA","SOUTH DAKOTA"],
 ["NORTH DAKOTA","SOUTH CAROLINA"]]

What a handy little function, that clusterBy.

Update: made clear that clusterBy returns clusters in order of ascending signature.

Update 2007-10-31: For more interesting discussion of clusterBy and the original puzzle from NPR, see Anders Pearson’s blog: A Simple Programming Puzzle Seen Through Three Different Lenses.

Posted in
Tags , , , ,
11 comments
no trackbacks
Reddit Delicious

Comments

  1. pied said about 7 hours later:

    That’s a neat function !

    P!

  2. Paul Betts said 1 day later:

    I tried to implement this in Ruby – of course it’s not as good as the Haskell version, and there’s probably a much more elegant way to do it, but here it is:

    module Enumerable
      def cluster_by(&b)
        buf = {}; self.each { |x|
          a = yield x;
          buf[a] ? buf[a] << x : buf[a] = [x]
        }
        buf.keys.sort!.collect {|x| buf[x]}
      end
    end

    [I tweaked the code to fit into the narrow column width. —Tom]

    My question is in regard to the “signature”, how does collectTo define organizing “ascending” with arbitrary types of the signature? I suppose it just calls “sort”, which itself can support arbitrary types. Thoughts?

  3. Tom Moertel said 1 day later:

    Paul, thanks for the Ruby implementation. I was able to shorten it by having new Hash slots default to empty arrays:

    module Enumerable
      def cluster_by(&sig_fn)
        h = Hash.new { |h,k| h[k] = [] }
        self.each { |x| h[sig_fn[x]] << x }
        h.values_at(*h.keys.sort!)
      end
    end

    Regarding your question:

    [H]ow does [clusterBy] define organizing “ascending” with arbitrary types of the signature?

    It uses the natural ordering of the signature values.

    If you look at the type of clusterBy, you’ll see that the result of the signature function f has the type b and that b is qualified to be a member of the Ord type class. That means that any signature function you provide to clusterBy must produce results that can be ordered (and hence sorted). The sorting itself happens implicitly: Data.Map stores keys in an ordered tree and its elems function is defined to “return all elements of the map in the ascending order of their keys.”

    Thanks for your comment!

    Cheers,
    Tom

  4. augustss said 1 day later:
    You can also do
    ucombos xs = [[x,y] | x:ys <- tails xs, y <- ys]
    
  5. George said 1 day later:

    The same semantics are in turn implementable in terms of two other functions that I continually find helpful:

    equalBy f x y = (f x) == (f y)
    clusterBy f = groupBy (equalBy f)
    

    I suspect this is less efficient, though.

  6. George said 1 day later:

    Oops. What I wrote isn’t true – groupBy won’t put a pair of elements that satisfies the predicate into the same group if they are separated by an element that belongs in a different group.

    In other words, order matters to groupBy but not to clusterBy.

  7. Porges said 59 days later:

    Strangely enough, I wrote something similar the day before yesterday… I was making a map from morse values (lists of dit|dah) to strings.

    createDict :: [String] -> Map.Map [Morse] String
    createDict xs = Map.fromListWith (\ a b -> a ++ "/" ++ b)
                                     (map (wordToMorse &&& id) xs)
    

    Seems like clusterBy could be quite a common method!

  8. Porges said 59 days later:

    Whoops, that was a later version. The earlier one had just (++) instead of that lambda.

  9. Slobodan said 60 days later:

    ClusterBy is very nice, my common lisp solution http://paste.lisp.org/display/50081

    cheers Slobodan Blazeski

  10. Nick said 60 days later:

    If you fromListWith (flip (++)), you shouldn’t have to map reverse, iiuc.

  11. Tom Moertel said 61 days later:

    Nick, the problem with using flip (++) as the combining function for Map.fromListWith is that it results in quadratic run-time w.r.t. cluster size. By making sure the new element (which is passed as the first argument to the combining function) is always added to the head of the list, we are able to implement cluster extension with a single cons operation—O(1).

    If you wanted to keep the low cost and eliminate the reverse step, you could use Data.Sequence values instead of lists to represent clusters. For clarity, however, I just used lists.

    Cheers,
    Tom

Trackbacks

Use the following link to trackback from your own site:
http://blog.moertel.com/articles/trackback/562

(leave url/email »)

   Comment Markup Help Preview comment