Although at work I code mostly in Python – a language from which lambda and map were nearly removed – I still find that functional-programming experience has its benefits. One of the “functional-programming dividends” I notice most often is insight gained from considering problems from an algebraic perspective.
Recently, for example, I had a small parsing problem. I had to write code that, given a simple grammar specification as input, emits a regular expression that matches the language generated by the grammar.
Here’s a simplified version of the problem. A grammar specification is limited to a series of one or more atoms. For example, “a b c” represents the atom “a”, followed by the atom “b”, followed by the atom “c”. To generate the grammar, the series of atoms is interpreted such that each atom (except the last) generates a production rule of the following form:
atom_rule ::= <the literal atom> (SPACE <the next rule> | NOTHING)
(SPACE represents literal white space and NOTHING represents an empty string.) The grammar as a whole is rooted in the first atom’s rule.
Thus the specification “a b c” represents the following grammar:
grammar ::= a_rule a_rule ::= "a" (SPACE b_rule | NOTHING) b_rule ::= "b" (SPACE c_rule | NOTHING) c_rule ::= "c"
Note that the final atom’s production matches only the literal atom itself: it has no following rule on which to chain.
The grammar, in turn, generates the following language:
a a b a b c
Thus, given the grammar specification “a b c”, my code had to produce a regular expression that would match “a”, “a b”, or “a b c”.
At this point, please stop for a moment and think about this little programming exercise. Do any solutions leap to mind? How would you approach the problem? Form your opinions now, because I’m going to ask you about them later. (If you’re feeling especially caffeinated, try coding a solution before reading on.)
For me, the insight that made the exercise easy was seeing that the grammar is given by folding a (suitably defined) right-associative binary operator through the series of atoms. The relationship might be easier to see if you substitute away the intermediate production rules from the grammar above:
grammar ::= "a" (SPACE "b" (SPACE "c" | NOTHING) | NOTHING)
If you squint past the SPACE and NOTHING terms, you’ll see that the grammar has the form
(a + (b + (c)))
The + is a binary operator that generates the parts we squinted away. Once you see what’s going on structurally, the operator is easy to define:
x + y = x (SPACE y | NOTHING)
Compare the operator’s definition with that of the
atom_rule I presented at the beginning of the article. They’re structurally the same: the operator’s x and y are the atom rule’s
<the literal atom> and
<the next rule>.
Now all that remains is to generalize the “a b c” formula into a general formula that works for arbitrary grammar specifications. Fortunately, this work has already been done for us. The generalized formula is nothing more than a right fold. In Haskell, the particular right-fold flavor we want is called foldr1.
Given a list of atoms, we can use foldr1 to construct its grammar as follows:
mk_grammar atoms = foldr1 (+) atoms
But Python, our implementation language, does not offer a foldr1 function. This wrinkle, however, is another thing we can iron out by thinking algebraically. Python doesn’t have foldr1, but it does have a reduce function, which represents a left fold, equivalent to Haskell’s foldl’ or foldl1’. Because our + operator is strict and our list of atoms is finite, we can take advantage of the following identity:
foldr1 (+) xs == foldl1 (flip (+)) (reverse xs)
That is, we can convert a right fold into a left fold by flipping the arguments of the operator and operating on the list in reverse. Thus we can implement the fold we want in terms of the fold we have:
# Python code def foldr1(f, xs): return reduce(flip(f), reversed(xs)) def flip(f): def g(x, y): return f(y, x) return g
Now writing a Python-based solution is straightforward:
def grammar_spec_to_re(spec): atoms = grammar_spec_to_atoms(spec) atom_res = map(atom_to_re, atoms) grammar_re = r'\A%s\Z' % foldr1(op, atom_res) return grammar_re def op(x, y): # x + y = x (SPACE y | NOTHING) return r'%s(\s+%s)?' % (x, y) def grammar_spec_to_atoms(spec): return spec.split() def atom_to_re(atom): return re.escape(atom)
Using our solution, let’s compile the “a b c” grammar specification into its corresponding regular expression:
>>> print grammar_spec_to_re('a b c') \Aa(\s+b(\s+c)?)?\Z
And that’s basically how I solved the problem.
To play around with the solution, here’s a small helper class that compiles a grammar specification into a regular expression and then tests strings for matching the regexp:
class GrammarMatcher(object): def __init__(self, spec): self.re = re.compile(grammar_spec_to_re(spec)) def __call__(self, s): return not not self.re.match(s)
Now, let’s try out the regular expression generated for the grammar specification “a b c” :
>>> matcher = GrammarMatcher('a b c') >>> matcher('') False >>> matcher('a') True >>> matcher('ab') False >>> matcher('a b') True >>> matcher('a b') True >>> matcher('a b c') True >>> matcher('a b c d') False >>> matcher('a c') False >>> matcher('b') False
Now, those questions I promised. If you’re a functional programmer, did a fold-based solution leap out at you? (Did you think of the problem in terms of folds?) If you’re not a functional programmer, how did you see the problem? Did the solution above seem twisted, confusing, or overly clever?
(There are no right or wrong answers. I’m just curious about how people with different backgrounds view the problem.)