<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/stylesheets/rss.css" type="text/css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Tom Moertel's Weblog: Tag parsing</title>
    <link>http://blog.moertel.com/articles/tag/parsing?tag=parsing</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Quality rants on programming theory and stuff geeks like</description>
    <item>
      <title>Power parsing with Haskell and Parsec</title>
      <description>&lt;p&gt;One of the projects I&amp;#8217;m working on is a language to help researchers
manipulate genetics information.  Despite all the well-publicized
advances in genetics, researchers still spend about a third of their
time writing shell, awk, and Perl scripts to manipulate their data.
If researchers can get some of this time back, they can use it to
think about more interesting problems, like curing cancer and stuff like that.&lt;/p&gt;&lt;p&gt;The existing tools do get the job done, but not efficiently.  For
example, the bulk of the data sets my friends work with are tabular,
and none of the aforementioned tools support vector or table
operations natively.  So my friends end up shuffling the data between
these tools and &lt;a href="http://www.r-project.org"&gt;R&lt;/a&gt; and Excel.  (These guys
have huge Apple Cinema Displays that make massive spreadsheets only
slightly more manageable.)&lt;/p&gt;


	&lt;p&gt;So I have invested much thought in designing the syntax of
&lt;acronym title="Genetics Information Manipulation Language"&gt;GIML&lt;/acronym&gt; to make slicing and
dicing tabular data easy.  I&amp;#8217;ll compare &lt;span class="caps"&gt;GIML&lt;/span&gt; with &lt;span class="caps"&gt;SQL&lt;/span&gt; for
illustration.&lt;/p&gt;


	&lt;table&gt;
		&lt;tr&gt;
			&lt;th&gt;&lt;span class="caps"&gt;GIML&lt;/span&gt; &lt;/th&gt;
			&lt;th&gt;&lt;span class="caps"&gt;SQL&lt;/span&gt; &lt;/th&gt;
			&lt;th&gt;Description &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;&lt;code&gt;t[x&amp;gt;3]&lt;/code&gt;
&lt;/td&gt;
			&lt;td&gt;&lt;code&gt;select * from t where x&amp;gt;3&lt;/code&gt;
&lt;/td&gt;
			&lt;td&gt; selection &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;&lt;code&gt;t$(x,y)&lt;/code&gt;
&lt;/td&gt;
			&lt;td&gt;&lt;code&gt;select x,y from t&lt;/code&gt;
&lt;/td&gt;
			&lt;td&gt; projection &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;&lt;code&gt;t[x&amp;gt;3]$(x,y)&lt;/code&gt;&amp;nbsp;
&lt;/td&gt;
			&lt;td&gt;&lt;code&gt;select x,y from t where x&amp;gt;3&lt;/code&gt;&amp;nbsp;
&lt;/td&gt;
			&lt;td&gt; selection + projection &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




	&lt;p&gt;What makes &lt;span class="caps"&gt;GIML&lt;/span&gt;&amp;#8217;s table operations particularly useful is that they
can be mixed into expressions.  You can, for example, grab a column of
numbers from a table (project the column into a vector) and then
perform arithmetic operations on it:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;t[x&amp;gt;3]$radius * 2&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;That makes parsing tricky because selection and projection
have special syntaxes and yet they must be mixed into the
normal expression grammar with numbers, strings, and so on.&lt;/p&gt;


	&lt;p&gt;To see why this is tricky, consider a simple grammar for arithmetic
expressions composed of additions and multiplications.  Multiplication
has the highest precedence, and so we might write the grammar in
pseudo &lt;acronym title="Backus-Naur Form"&gt;BNF&lt;/acronym&gt; as follows:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr  = add
add   = add "+" times  | times
times = times "*" term | term
term  = NUMBER
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Notice that the expression grammar starts by trying to parse the
lowest-precedence infix sub-expressions (additions) first and then
works toward the highest-precedence infix sub-expressions
(multiplications) before finally parsing atomic terms (numbers), which
have the highest precedence.&lt;/p&gt;


	&lt;p&gt;Given an expression,&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;1 + 2 * 3
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;our grammar will parse it like so because of our implicit precedence
rules:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;1 + (2 * 3)
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;But what if we want a way to perform addition first sometimes?  Our
parser, as is, won&amp;#8217;t let us.  We need a way to override the grammar&amp;#8217;s
implicit precedence rules.  The most common solution is to allow
sub-expressions to be promoted to the highest precedence by wrapping
them in parentheses.  We can handle this in our grammar by extending
the &lt;em&gt;term&lt;/em&gt; rule:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;term  = NUMBER | "(" expr ")" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Now we can add parentheses to our expressions, and &lt;code&gt;(1 + 2) *
3&lt;/code&gt; will parse just fine.&lt;/p&gt;


	&lt;p&gt;Well, actually, it won&amp;#8217;t.  Our parser will hang because our grammar
can recurse without consuming any input.  Consider the rule for &lt;em&gt;add&lt;/em&gt;:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;add = add "+" times | times
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;The first production tries to match another &lt;em&gt;add&lt;/em&gt; sub-expression
because we want to be able to parse expressions involving chained
additions like &lt;code&gt;1+2+3+4&lt;/code&gt;.  But when trying to match that
sub-expression, we&amp;#8217;re just going to recurse again and again &amp;#8211; forever.
The solution is to left factor the grammar:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;add  = times addx
addx = "+" times addx | NOTHING
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;The &lt;em&gt;addx&lt;/em&gt; rule does indeed recurse, but not until it has consumed a
plus sign, and so our grammar won&amp;#8217;t recurse forever.  When it runs out
of plus signs to eat, it will stop recursing.&lt;/p&gt;


	&lt;p&gt;Fully left-factoring our grammar, here&amp;#8217;s what we get:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr   = add
add    = times addx
addx   = "+" times addx | NOTHING
times  = term timesx
timesx = "*" term timesx | NOTHING
term   = NUMBER | "(" expr ")" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Now let&amp;#8217;s try to add selection to this grammar.  (We will ignore that
this move is semantically absurd for an algebraic-expression grammar.)
The selection syntax we want is &lt;em&gt;d&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;], where &lt;em&gt;d&lt;/em&gt; and &lt;em&gt;e&lt;/em&gt; are
expressions, and so our selection rule looks as follows:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;select = expr "[" expr "]" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Great.  Now all we have to do is connect this rule to the rest of our
grammar.  The most natural place to do so is at &lt;em&gt;term&lt;/em&gt;, but look at
what happens when we do that:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr   = add
add    = times addx
addx   = "+" times addx | NOTHING
times  = term timesx
timesx = "*" term timesx | NOTHING
term   = NUMBER | "(" expr ")" | select
select = expr "[" expr "]" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Do you see the problem?  Our &lt;em&gt;select&lt;/em&gt; rule recurses (via &lt;em&gt;expr&lt;/em&gt;)
without consuming input.  If we connect it at &lt;em&gt;term&lt;/em&gt;, then, our
parser might get stuck in infinite recursion.&lt;/p&gt;


	&lt;p&gt;A tempting alternative is to attach our select rule at the root of the
grammar, in effect making an expression either a selection or an infix
sub-expression:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr   = select | infix
select = infix "[" expr "]" 
infix  = add
add    = times addx
addx   = "+" times addx | NOTHING
times  = term timesx
timesx = "*" term timesx | NOTHING
term   = NUMBER | "(" expr ")" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;The problem with this solution is that we cannot write expressions
like this:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;6[3] * 2
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;In order to mix selections into infix expressions, we must
parenthesize them, letting the parser see them as
infix-friendly &lt;em&gt;terms&lt;/em&gt;:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;(6[3]) * 2
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;But this &amp;#8220;solution&amp;#8221; puts the burden of carrying our grammar&amp;#8217;s
limitations upon the users of our little language. The syntactical
price for selections has increased from &lt;em&gt;d&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;] to (&lt;em&gt;d&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;]),
which runs counter to the goal of creating an ideal syntax for people
who might use selections frequently.  More typing means less good.&lt;/p&gt;


	&lt;p&gt;So what should we do?  The solution is to break apart the &lt;em&gt;select&lt;/em&gt;
rule and place the resulting pieces into a new layer of precedence between
&lt;em&gt;times&lt;/em&gt; and &lt;em&gt;term&lt;/em&gt;:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr    = add
add     = times addx
addx    = "+" times addx | NOTHING
times   = select timesx
timesx  = "*" term timesx | NOTHING
select  = term selectx
selectx = "[" expr "]" | NOTHING
term    = NUMBER | "(" expr ")" 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now we can parse &lt;code&gt;6[3] * 2&lt;/code&gt; as expected.  Notice that this
implementation gives selection a precedence.  Where before our syntax
was &lt;em&gt;d&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;], now it is &lt;em&gt;t&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;], where &lt;em&gt;t&lt;/em&gt; is a term.  Thus
&lt;code&gt;6*2[3]&lt;/code&gt; will be parsed as &lt;code&gt;6*(2[3])&lt;/code&gt;.  If we
want a different parsing, we can supply parentheses to make it happen:
&lt;code&gt;(6*2)[3]&lt;/code&gt;.&lt;/p&gt;

	&lt;p&gt;Look at how our grammar handles addition and multiplication and then
look at selection.  There&amp;#8217;s a pattern, definitely, in all three, but
our &lt;em&gt;select&lt;/em&gt; rule differs subtly.  Whereas addition and multiplication
are expressed via infix operators, selection occurs via a suffix
operator (often called a &amp;#8220;postfix&amp;#8221; operator).  To see it more clearly,
let&amp;#8217;s refactor our grammar again:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;expr    = expr1

expr1   = pfx1 infix1 sfx1
pfx1    = NOTHING
sfx1    = NOTHING
infix1  = expr2 infix1x
infix1x = "+" expr2 infix1x

expr2   = pfx2 infix2 sfx2
pfx2    = NOTHING
sfx2    = NOTHING
infix2  = expr3 infix2x
infix2x = "*" expr3 infix2x

expr3   = pfx3 infix3 sfx3
pfx3    = NOTHING
sfx3    = select
infix3  = term  # not truly infix
select  = "[" expr "]" | NOTHING

term    = NUMBER | "(" expr ")" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;This refactoring is equivalent to the previous version of our grammar
but puts each level of precedence into a clearly visible layer whose
prefix, infix, and suffix components are made explicit.&lt;/p&gt;


	&lt;p&gt;Each layer is a sub-grammar that follows a common pattern.  First, it
tries to match a prefix.  Then it tries to match an infix
subexpression comprising two higher-precedence subexpressions joined
by an infix operator.  Finally it tries to match a suffix.  Each layer
&amp;#8220;chains&amp;#8221; to the next, and the entire chain forms the grammar for our
expression language.&lt;/p&gt;


	&lt;p&gt;Once we become familiar with this pattern for linking together a
chain of expression-grammar layers, we can write a small program to do
it for us.  We can give this program a list of layers, each comprising
a set of prefix, infix, and suffix operators, and the program will
emit an expression grammar that chains them together for us.&lt;/p&gt;


	&lt;p&gt;The &lt;a href="http://www.cs.uu.nl/~daan/parsec.html"&gt;Parsec&lt;/a&gt;
monadic-parser-combinator library for the
&lt;a href="http://www.haskell.org"&gt;Haskell&lt;/a&gt; programming language has a function
to do just this: &lt;a href="http://www.cs.uu.nl/~daan/download/parsec/parsec.html#buildExpressionParser"&gt;buildExpressionParser&lt;/a&gt;.  (In fact, this function even lets us define
the associativity of our operators.) With it, we could build a parser
for our grammar &amp;#8211; and make it evaluate our arithmetic
expressions on the fly! &amp;#8211; like this:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varid'&gt;expr&lt;/span&gt;    &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;buildExpressionParser&lt;/span&gt; &lt;span class='varid'&gt;optable&lt;/span&gt; &lt;span class='varid'&gt;term&lt;/span&gt;
&lt;span class='varid'&gt;optable&lt;/span&gt; &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt; &lt;span class='conid'&gt;Postfix&lt;/span&gt; &lt;span class='varid'&gt;select&lt;/span&gt; &lt;span class='keyglyph'&gt;]&lt;/span&gt;  &lt;span class='comment'&gt;-- highest precedence first&lt;/span&gt;
          &lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt; &lt;span class='conid'&gt;Infix&lt;/span&gt; &lt;span class='varid'&gt;times&lt;/span&gt; &lt;span class='conid'&gt;AssocLeft&lt;/span&gt; &lt;span class='keyglyph'&gt;]&lt;/span&gt;
          &lt;span class='layout'&gt;,&lt;/span&gt; &lt;span class='keyglyph'&gt;[&lt;/span&gt; &lt;span class='conid'&gt;Infix&lt;/span&gt; &lt;span class='varid'&gt;add&lt;/span&gt; &lt;span class='conid'&gt;AssocLeft&lt;/span&gt; &lt;span class='keyglyph'&gt;]&lt;/span&gt; 
          &lt;span class='keyglyph'&gt;]&lt;/span&gt;
&lt;span class='varid'&gt;term&lt;/span&gt;    &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;natural&lt;/span&gt; &lt;span class='varop'&gt;&amp;lt;|&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;parens&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt;
&lt;span class='varid'&gt;select&lt;/span&gt;  &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;brackets&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt; &lt;span class='varop'&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span class='keyglyph'&gt;\&lt;/span&gt;&lt;span class='varid'&gt;e&lt;/span&gt; &lt;span class='keyglyph'&gt;-&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;return&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varop'&gt;`mod`&lt;/span&gt; &lt;span class='varid'&gt;e&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='varid'&gt;times&lt;/span&gt;   &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;string&lt;/span&gt; &lt;span class='str'&gt;"*"&lt;/span&gt; &lt;span class='varop'&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;return&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varop'&gt;*&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;span class='varid'&gt;add&lt;/span&gt;     &lt;span class='keyglyph'&gt;=&lt;/span&gt; &lt;span class='varid'&gt;string&lt;/span&gt; &lt;span class='str'&gt;"+"&lt;/span&gt; &lt;span class='varop'&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;return&lt;/span&gt; &lt;span class='layout'&gt;(&lt;/span&gt;&lt;span class='varop'&gt;+&lt;/span&gt;&lt;span class='layout'&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;I should point out that the above is real, executable Haskell code.
That it looks like a grammar in &lt;span class="caps"&gt;BNF&lt;/span&gt; is a testament to Haskell&amp;#8217;s
excellent support for embedding domain-specific languages.&lt;/p&gt;


	&lt;p&gt;Okay, back to specifics.  Since selection doesn&amp;#8217;t really have a place
in our limited arithmetic grammar, I made &lt;em&gt;d&lt;/em&gt;&amp;#91;&lt;em&gt;e&lt;/em&gt;] compute (&lt;em&gt;d&lt;/em&gt;
mod &lt;em&gt;e&lt;/em&gt;).  The &lt;em&gt;natural&lt;/em&gt;, &lt;em&gt;parens&lt;/em&gt;, and &lt;em&gt;brackets&lt;/em&gt; rules are supplied
by Parsec.  The first matches natural numbers.  The second and third
take a parser &lt;em&gt;p&lt;/em&gt; and convert it into another parser that matches what
&lt;em&gt;p&lt;/em&gt; does but inside of parentheses and brackets, respectively.&lt;/p&gt;


	&lt;p&gt;Trying out our &lt;em&gt;expr&lt;/em&gt; parser using &lt;em&gt;parseTest&lt;/em&gt;, a
parse-and-print-the-result wrapper provided as part of Parsec, shows
that it does indeed do what we want:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_haskell "&gt;&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;parseTest&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt; &lt;span class='str'&gt;"1+2"&lt;/span&gt;
&lt;span class='num'&gt;3&lt;/span&gt;
&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;parseTest&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt; &lt;span class='str'&gt;"1+2*2"&lt;/span&gt;
&lt;span class='num'&gt;5&lt;/span&gt;
&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;parseTest&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt; &lt;span class='str'&gt;"1+2*2[3]"&lt;/span&gt;
&lt;span class='num'&gt;5&lt;/span&gt;
&lt;span class='varop'&gt;&amp;gt;&lt;/span&gt; &lt;span class='varid'&gt;parseTest&lt;/span&gt; &lt;span class='varid'&gt;expr&lt;/span&gt; &lt;span class='str'&gt;"(1+2*2)[3]"&lt;/span&gt;
&lt;span class='num'&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;The point of all this is to show just how much power and convenience
Haskell and Parsec give you for free.  What once was a tedious job
requiring careful derivation and refactoring is now easy: just fill
out an operator table and give it to &lt;em&gt;buildExpressionParser&lt;/em&gt;.
Countless opportunities for subtle errors have been eliminated.  The
meaning of the parser is clear; the operators are listed in order of
decreasing precedence, and their associativity is plain as day.&lt;/p&gt;


	&lt;p&gt;If you are writing a parser, you ought to check out Parsec.
I wouldn&amp;#8217;t do it any other way.&lt;/p&gt;</description>
      <pubDate>Sat, 27 Aug 2005 17:12:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:98638100aee77308b3edfefdd8bca490</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2005/08/27/power-parsing-with-haskell-and-parsec</link>
      <category>functional programming</category>
      <category>haskell</category>
      <category>haskell</category>
      <category>parsec</category>
      <category>parsing</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/2</trackback:ping>
    </item>
  </channel>
</rss>
