Power parsing with Haskell and Parsec

By
Posted on
Tags: haskell, parsec, parsing

One of the projects I’m working on is a language to help researchers manipulate genetics information. Despite all the well-publicized advances in genetics, researchers still spend about a third of their time writing shell, awk, and Perl scripts to manipulate their data. If researchers can get some of this time back, they can use it to think about more interesting problems, like curing cancer and stuff like that.

The existing tools do get the job done, but not efficiently. For example, the bulk of the data sets my friends work with are tabular, and none of the aforementioned tools support vector or table operations natively. So my friends end up shuffling the data between these tools and R and Excel. (These guys have huge Apple Cinema Displays that make massive spreadsheets only slightly more manageable.)

So I have invested much thought in designing the syntax of GIML (Genetics Information Manipulation Language) to make slicing and dicing tabular data easy. I’ll compare GIML with SQL for illustration.

GIML SQL Description
t[x>3] select * from t where x>3 selection
t$(x,y) select x,y from t projection
t[x>3]$(x,y)  select x,y from t where x>3  selection + projection

What makes GIML’s table operations particularly useful is that they can be mixed into expressions. You can, for example, grab a column of numbers from a table (project the column into a vector) and then perform arithmetic operations on it:

t[x>3]$radius * 2

That makes parsing tricky because selection and projection have special syntaxes and yet they must be mixed into the normal expression grammar with numbers, strings, and so on.

To see why this is tricky, consider a simple grammar for arithmetic expressions composed of additions and multiplications. Multiplication has the highest precedence, and so we might write the grammar in pseudo BNF (Backus-Naur Form) as follows:

expr  ::= add
add   ::= add "+" times  | times
times ::= times "*" term | term
term  ::= NUMBER

Notice that the expression grammar starts by trying to parse the lowest-precedence infix sub-expressions (additions) first and then works toward the highest-precedence infix sub-expressions (multiplications) before finally parsing atomic terms (numbers), which have the highest precedence.

Given an expression,

1 + 2 * 3

our grammar will parse it like so because of our implicit precedence rules:

1 + (2 * 3)

But what if we want a way to perform addition first sometimes? Our parser, as is, won’t let us. We need a way to override the grammar’s implicit precedence rules. The most common solution is to allow sub-expressions to be promoted to the highest precedence by wrapping them in parentheses. We can handle this in our grammar by extending the term rule:

term  ::= NUMBER | "(" expr ")"

Now we can add parentheses to our expressions, and (1 + 2) * 3 will parse just fine.

Well, actually, it won’t. Our parser will hang because our grammar can recurse without consuming any input. Consider the rule for add:

add ::= add "+" times | times

The first production tries to match another add sub-expression because we want to be able to parse expressions involving chained additions like 1+2+3+4. But when trying to match that sub-expression, we’re just going to recurse again and again – forever. The solution is to left factor the grammar:

add  ::= times addx
addx ::= "+" times addx | NOTHING

The addx rule does indeed recurse, but not until it has consumed a plus sign, and so our grammar won’t recurse forever. When it runs out of plus signs to eat, it will stop recursing.

Fully left-factoring our grammar, here’s what we get:

expr   ::= add
add    ::= times addx
addx   ::= "+" times addx | NOTHING
times  ::= term timesx
timesx ::= "*" term timesx | NOTHING
term   ::= NUMBER | "(" expr ")"

Now let’s try to add selection to this grammar. (We will ignore that this move is semantically absurd for an algebraic-expression grammar.) The selection syntax we want is d[e], where d and e are expressions, and so our selection rule looks as follows:

select ::= expr "[" expr "]"

Great. Now all we have to do is connect this rule to the rest of our grammar. The most natural place to do so is at term, but look at what happens when we do that:

expr   ::= add
add    ::= times addx
addx   ::= "+" times addx | NOTHING
times  ::= term timesx
timesx ::= "*" term timesx | NOTHING
term   ::= NUMBER | "(" expr ")" | select
select ::= expr "[" expr "]"

Do you see the problem? Our select rule recurses (via expr) without consuming input. If we connect it at term, then, our parser might get stuck in infinite recursion.

A tempting alternative is to attach our select rule at the root of the grammar, in effect making an expression either a selection or an infix sub-expression:

expr   ::= select | infix
select ::= infix "[" expr "]"
infix  ::= add
add    ::= times addx
addx   ::= "+" times addx | NOTHING
times  ::= term timesx
timesx ::= "*" term timesx | NOTHING
term   ::= NUMBER | "(" expr ")"

The problem with this solution is that we cannot write expressions like this:

6[3] * 2

In order to mix selections into infix expressions, we must parenthesize them, letting the parser see them as infix-friendly terms:

(6[3]) * 2

But this “solution” puts the burden of carrying our grammar’s limitations upon the users of our little language. The syntactical price for selections has increased from d[e] to (d[e]), which runs counter to the goal of creating an ideal syntax for people who might use selections frequently. More typing means less good.

So what should we do? The solution is to break apart the select rule and place the resulting pieces into a new layer of precedence between times and term:

expr    ::= add
add     ::= times addx
addx    ::= "+" times addx | NOTHING
times   ::= select timesx
timesx  ::= "*" term timesx | NOTHING
select  ::= term selectx
selectx ::= "[" expr "]" | NOTHING
term    ::= NUMBER | "(" expr ")"

Now we can parse 6[3] * 2 as expected. Notice that this implementation gives selection a precedence. Where before our syntax was d[e], now it is t[e], where t is a term. Thus 6*2[3] will be parsed as 6*(2[3]). If we want a different parsing, we can supply parentheses to make it happen: (6*2)[3].

Look at how our grammar handles addition and multiplication and then look at selection. There’s a pattern, definitely, in all three, but our select rule differs subtly. Whereas addition and multiplication are expressed via infix operators, selection occurs via a suffix operator (often called a “postfix” operator). To see it more clearly, let’s refactor our grammar again:

expr    ::= expr1

expr1   ::= pfx1 infix1 sfx1
pfx1    ::= NOTHING
sfx1    ::= NOTHING
infix1  ::= expr2 infix1x
infix1x ::= "+" expr2 infix1x | NOTHING

expr2   ::= pfx2 infix2 sfx2
pfx2    ::= NOTHING
sfx2    ::= NOTHING
infix2  ::= expr3 infix2x
infix2x ::= "*" expr3 infix2x | NOTHING

expr3   ::= pfx3 infix3 sfx3
pfx3    ::= NOTHING
sfx3    ::= select
infix3  ::= term  # not truly infix
select  ::= "[" expr "]" | NOTHING

term    ::= NUMBER | "(" expr ")"

This refactoring is equivalent to the previous version of our grammar but puts each level of precedence into a clearly visible layer whose prefix, infix, and suffix components are made explicit.

Each layer is a sub-grammar that follows a common pattern. First, it tries to match a prefix. Then it tries to match an infix subexpression comprising two higher-precedence subexpressions joined by an infix operator. Finally it tries to match a suffix. Each layer “chains” to the next, and the entire chain forms the grammar for our expression language.

Once we become familiar with this pattern for linking together a chain of expression-grammar layers, we can write a small program to do it for us. We can give this program a list of layers, each comprising a set of prefix, infix, and suffix operators, and the program will emit an expression grammar that chains them together for us.

The Parsec monadic-parser-combinator library for the Haskell programming language has a function to do just this: buildExpressionParser. (In fact, this function even lets us define the associativity of our operators.) With it, we could build a parser for our grammar – and make it evaluate our arithmetic expressions on the fly! – like this:

expr    = buildExpressionParser optable term
optable = [ [ Postfix select ]  -- highest precedence first
          , [ Infix times AssocLeft ]
          , [ Infix add AssocLeft ]
          ]
term    = natural <|> parens expr
select  = brackets expr >>= \e -> return (`mod` e)
times   = string "*" >> return (*)
add     = string "+" >> return (+)

I should point out that the above is real, executable Haskell code. That it looks like a grammar in BNF is a testament to Haskell’s excellent support for embedding domain-specific languages.

Okay, back to specifics. Since selection doesn’t really have a place in our limited arithmetic grammar, I made d[e] compute (d mod e). The natural, parens, and brackets rules are supplied by Parsec. The first matches natural numbers. The second and third take a parser p and convert it into another parser that matches what p does but inside of parentheses and brackets, respectively.

Trying out our expr parser using parseTest, a parse-and-print-the-result wrapper provided as part of Parsec, shows that it does indeed do what we want:

> parseTest expr "1+2"
3
> parseTest expr "1+2*2"
5
> parseTest expr "1+2*2[3]"
5
> parseTest expr "(1+2*2)[3]"
2

The point of all this is to show just how much power and convenience Haskell and Parsec give you for free. What once was a tedious job requiring careful derivation and refactoring is now easy: just fill out an operator table and give it to buildExpressionParser. Countless opportunities for subtle errors have been eliminated. The meaning of the parser is clear; the operators are listed in order of decreasing precedence, and their associativity is plain as day.

If you are writing a parser, you ought to check out Parsec. I wouldn’t do it any other way.