PXSL Tools 1.0: Your ticket out of XML Hell

Posted by Tom Moertel Tue, 18 Dec 2007 03:33:00 GMT

XML is fine for representing document-like things, but when it’s twisted to represent build recipes, configuration files, and little programming languages, it opens the gates to XML Hell. Once the gates are opened, the demons of cargo-cult thinking are loosed upon the world, where they are free to trick innocent programmers into working with grotesquely twisted XML documents – something no human mind was designed to comprehend. Ensnared, these programmers are slowly drawn into the depths of XML Hell, from which their lamentations echo across the universe.

When the demons of cargo-cult thinking come for you, don’t be ensnared! Instead, be prepared – with PXSL – the Parsimonious XML Shorthand Language (pronounced “pixel”).

What’s PXSL? It’s a luxurious, thermonuclear smoking jacket that you can slip on using a convenient preprocessor. Use it whenever you see grotesque XML on the horizon. Within PXSL’s plush (and stylish) protection, you can create all the nasty, twisted XML that may be demanded of you, but you need not descend into XML Hell to do it. Instead, you can work from the comfort of a well-stocked lounge, where clarity and conciseness are always on tap.

For example, here’s a snippet from an XSLT stylesheet, in the original XML:

<xsl:template match="/">
  <xsl:for-each select="//*/@src|//*/@href">
    <xsl:value-of select="."/>
    <xsl:text>&#10;</xsl:text>
  </xsl:for-each>
</xsl:template>

And here’s the same snippet, written in PXSL:

template /
  for-each //*/@src|//*/@href
    value-of .
    text <<&#10;>>

Isn’t that refreshing?

Why PXSL?

There are lots of XML shorthands available. (The PXSL FAQ lists about ten of them.) So why choose PXSL? Here’s why:

Also, PXSL is battle tested. It was first released in 2003 and has been saving people from XML Hell since. People who try it seem to like it:

  • I think PXSL could do wonders for soothing my irrational hatred for all things XML.kowey
  • Impressive… I converted some of my files from XML to PXSL and the readability was much improved.chris
  • Quite aside from the fact that XSLT is finally somewhat readable, the fact that you’ve added a serious macro system means that some serious scripting of XML can occur. I’m very impressed.invisible

The next time you’re headed for XML Hell, why not give the venerable PXSL a try? You might just find that you like it, too.


This public service announcement was brought to you in celebration of the 1.0 release of the pxsl-tools package. The PXSL-to-XML compiler pxslcc is written in Haskell and uses the cross-platform Haskell Cabal build/package system to let you use PXSL just about anywhere.

Posted in
Tags , , ,
6 comments
no trackbacks
Reddit Delicious

How I stopped missing Darcs and started loving Git

Posted by Tom Moertel Mon, 10 Dec 2007 21:52:00 GMT

About three years ago, I switched to Darcs as my primary source-code management system. It was simple, intuitive, and powerful, and it made managing my projects more fun and less frustrating than any centralized VCS ever had. That it was written in Haskell, one of my favorite programming languages, made it even better. I was hooked.

Since then, the distributed SCM landscape has changed. Darcs hasn’t improved much, but its competitors have made long strides, especially Git and Mercurial. Both are crazy fast, vigorously developed, and widely used on large, highly active real-world projects, such as the Linux kernel and Mozilla 2. In comparison, Darcs has stagnated.

When I started working for a new company recently, I had to consider whether to advocate Darcs or something else. In the end, I decided that Darcs would be a hard sell. Nobody else at the company uses Haskell, and having to explain how to avoid the occasional corner case seemed liked a losing proposition.

After researching and playing around with Git and Mercurial, I settled on Git. I like Git’s underlying hashed-blobs model better than Mercurial’s revlogs, and Git seems to have slightly more development momentum. Still, it was a close call. Either choice would have been completely reasonable.

Missing Darcs

When I started using Git on real projects, the one thing I really missed was the ability to easily amend earlier patches, something Darcs made trivial. Let me explain. The typical development workflow goes something like this:

  1. Checkout copy of upstream code base.
  2. Implement feature X.
  3. Commit.
  4. Implement independent feature Y.
  5. Commit.
  6. Implement independent feature Z.
  7. Commit.
  8. Push new features back upstream.

Now, what really happens is that when I’m implementing Y or Z, I’ll realize that I made a mistake in X. The trick is then fixing X so that my fix is part of the changeset/patch for X that ultimately gets pushed upstream in the last step. That way, the upstream folks will see only a single, clean patch for feature X – not a mishmash of patches that together represent X.

In Darcs, amending the original patch is easy because its patch theory lets me tweak the patch for X independently of the other patches. Darcs will simply ask me which patch I want to amend, and I’ll select the orignal patch for X:

$ emacs               # fix X
$ darcs amend-record  # amend original patch for X

Mon Dec 10 14:43:13 EST 2007  Tom Moertel <tom@moertel.com>
  * Implemented Z
Shall I amend this patch? [yNvpq], or ? for help: n

Mon Dec 10 14:42:12 EST 2007  Tom Moertel <tom@moertel.com>
  * Implemented Y
Shall I amend this patch? [yNvpq], or ? for help: n

Mon Dec 10 14:41:46 EST 2007  Tom Moertel <tom@moertel.com>
  * Implemented X
Shall I amend this patch? [yNvpq], or ? for help: y
hunk ./x 1
-X1
+X2
Shall I add this change? (1/?)  [ynWsfqadjkc], or ? for help: y
Finished amending patch:
Mon Dec 10 14:43:25 EST 2007  Tom Moertel <tom@moertel.com>
  * Implemented X

That’s it. The exact same process will work regardless of when I realize I need to fix X: before I start Y, while I’m implementing Y, after I’ve committed Y, while I’m working on Z, or after I’ve committed Z.

Learning to love Git

With Git, however, I can amend a commit only if I haven’t committed anything else before making my fix. In Git’s mind, Y depends on X, and Z depends on Y, even if they really are independent of one another.

So if I commit the original patch for X and then immediately realize I need to make a fix, before I start working on Y or Z, it’s easy:

$ emacs               # implement X
$ git commit -m 'Implemented X'

# discover problem in X

$ emacs               # fix X
$ git commit --amend  # amend original patch

More typically, it’s only while I’m working on Y that I’ll realize I need to fix X. Then it’s more complicated to amend the original commit:

$ emacs               # implement X
$ git commit -m 'Implemented X'
$ emacs               # start working on Y

# discover problem in X

$ git stash           # stash away half-completed work on Y
$ emacs               # fix X
$ git commit --amend  # amend original patch for X
$ git stash apply     # restore work on Y
$ emacs               # continue working on Y

While not as convenient as Darcs’s workflow, it’s perfectly workable.

Now let’s consider another fairly typical case: I commit X and Y and then start working on Z before I notice the problem in X. I used to think that Git couldn’t handle this case, but it can, thanks to git rebase --interactive:
$ emacs               # implement X
$ git commit -m 'Implemented X'
$ emacs               # implement Y
$ git commit -m 'Implemented Y'
$ emacs               # start working on Z

# discover problem in X

$ git stash           # stash away half-completed work on Z
$ emacs               # fix X
$ git commit -m 'Fixed X'
$ git rebase --interactive HEAD~3  # see comments below
$ git stash apply     # restore work on Z
$ emacs               # continue working on Z
The git rebase --interactive command is powerful. What the command does, as called in the snippet above, is invoke my editor of choice on a text file describing the last 3 commits (that’s the HEAD~3 part):
# Rebasing 3ad99a7..b9a8405 onto 3ad99a7
#
# Commands:
#  pick = use commit
#  edit = use commit, but stop for amending
#  squash = use commit, but meld into previous commit
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
pick 0885540 Implemented X
pick 320b115 Implemented Y
pick b9a8405 Fixed X

I can then edit the file to reorder, merge (squash), and/or remove the commits. In this example, I want to merge the fix for X into the original commit that implemented X. So I edit the file like so:

pick 0885540 Implemented X
squash b9a8405 Fixed X
pick 320b115 Implemented Y

Then I save the file, at which point Git takes over and makes the requested changes, merging the fix for X into the original commit for X. Now the log shows the original implementation and fix as one commit:

$ git log
commit f387d650976246c0854d028b040cca40e542be56
Author: Tom Moertel <tom@moertel.com>
Date:   Mon Dec 10 15:11:26 2007 -0500

    Implemented Y

commit 82a1c849ffd1bd688d5bc9d99be0e63548a89c4c
Author: Tom Moertel <tom@moertel.com>
Date:   Mon Dec 10 15:13:03 2007 -0500

    Implemented X

    Fixed X

commit 3ad99a7ef537b7ae99e435e0d2b4b0d03de92c65
Author: Tom Moertel <tom@moertel.com>
Date:   Mon Dec 10 15:11:14 2007 -0500

    Initial checkin

Once I figured out how to use git rebase --interactive, I stopped missing Darcs and started loving Git.

Posted in
Tags , , , ,
19 comments
no trackbacks
Reddit Delicious

Practical differences between Darcs and Git/Mercurial

Posted by Tom Moertel Thu, 25 Oct 2007 20:26:00 GMT

On the Darcs Users mailing list, I ran across an interesting thread: practical differences between darcs’ patch model and git/mercurial’s?

Among the interesting points of discussion:

  • Do the mechanics that give rise to Darcs’s strong cherry-picking abilities also make it susceptible to naughty time-complexity behavior?
  • When you merge non-conflicting changes in Git or Mercurial, you must record a merge patch, which binds the two in the development timeline, but in Darcs the respective patches are free to commute. Which behavior is better for real-world development?

If you’re interested in distributed source-code management, it’s an interesting thread to follow.

Posted in
Tags , , , ,
no comments
no trackbacks
Reddit Delicious

ClusterBy: a handy little function for the toolbox

Posted by Tom Moertel Sat, 01 Sep 2007 19:39:00 GMT

Via Reddit I found Mark Nelson’s post about a recent word puzzle from NPR’s Weekend Edition:

Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?

The puzzle is fairly straightforward to solve by hand (think about it), but let’s write a program to solve it. That will give us a convenient excuse to discuss a super-handy function I use all the time: clusterBy. In Haskell, it looks like this:

import Control.Arrow ((&&&))
import qualified Data.Map as M

clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++)
            . map (f &&& return)

What clusterBy does is group a list of values by their signatures, as computed by a given signature function f, and returns the groups in order of ascending signature. For example, we can cluster the words “the tan ant gets some fat” by length, by first letter, or by last letter just by changing the signature function we give to clusterBy:

*Main> let antwords = words "the tan ant gets some fat"

*Main> clusterBy length antwords
[["the","tan","ant","fat"],["gets","some"]]

*Main> clusterBy head antwords
[["ant"],["fat"],["gets"],["some"],["the","tan"]]

*Main> clusterBy last antwords
[["the","some"],["tan"],["gets"],["ant","fat"]]

If we use sort as the signature function, we can find anagrams:

*Main> clusterBy sort antwords
[["fat"],["tan","ant"],["gets"],["the"],["some"]]

And that brings us back to the original puzzle. To find the solution, we must consider each unique pair of state names to form a “word” and find the anagrams among a list of such “words.”

Assuming we are given a list of state names on standard input, one state per line, we can write the shell of our solution as follows:

main = mapM_ print . solve . lines =<< getContents

The shell delegates the real work to solve. It’s job is to compute the unique, 2-state combinations from the original list of states, and then find the anagrams among these combinations. As before, finding the anagrams is simply a matter of calling clusterBy with the right signature function. We also filter out the trivial results, which are not valid solutions:

solve = filter ((>1) . length) . clusterBy signature . ucombos
ucombos xs = [[x,y] | x <- xs, y <- xs, x < y]
signature = sort . filter isAlpha . concat   -- sort letters

That’s it. Now we can solve the puzzle by feeding our program a list of states:

$ runhaskell anagrams2.hs < states.txt
[["NORTH CAROLINA","SOUTH DAKOTA"],
 ["NORTH DAKOTA","SOUTH CAROLINA"]]

What a handy little function, that clusterBy.

Update: made clear that clusterBy returns clusters in order of ascending signature.

Update 2007-10-31: For more interesting discussion of clusterBy and the original puzzle from NPR, see Anders Pearson’s blog: A Simple Programming Puzzle Seen Through Three Different Lenses.

Posted in
Tags , , , ,
11 comments
no trackbacks
Reddit Delicious

Seven signs YOU may have created a Gratuitous Domain Specific Language

Posted by Tom Moertel Sat, 18 Aug 2007 17:01:00 GMT

Like chromatic, I have watched the recent irrational exuberance for domain-specific languages (DSLs) with bewilderment. In certain quarters of the programming universe, it seems that creating DSLs is nearly a rite of passage. The problem is, more and more of these DSLs appear to have been created mainly because, well, DSLs are cool these days, even if less “novel” solutions probably would have been more sensible.

Whereas chromatic unhesitatingly confronted the madness head-on, I have so far managed to avoid the fray. Sure, I’ve asked the occasional probing question of the DSL enthusiast, but mostly my reaction has been limited to standing back and staring in mute amazement at the runaway Domain-Specific Fun-Time Language Train, screaming down the tracks, destined for its inevitable high-speed derailment into what I can only expect will be a bridge abutment. But I’m starting to get the feeling that some of the train’s passengers are aboard because they think it’s the Right Thing To Do Train, so maybe it’s time to throw in my two cents.

To set the record straight, I don’t have anything against DSLs, embedded or otherwise. (I have created my fair share, some of which are actually useful.) No, my concern is limited strictly to the rise of the Gratuitous DSL. So let’s talk about it.

The reason – the right reason – for creating a DSL is because it ultimately lowers the cost of solving problems. If, then, you create a DSL and the cost of solving your problems does not go down, why did you create it? Think about it. Creating a DSL is an expensive proposition. Making people learn your DSL’s syntax, semantics, and underlying domain is a lot to ask – it’s costly. If you do ask, if you do make the imposition, you had better be sure your DSL pays its bills.

But what if your DSL turns out to be a deadbeat? What if using your DSL doesn’t lower the cost of solving problems? Well, guess what? You have created a Gratuitous Domain Specific Language.

Still unsure of whether you’re on the DSL Train for the wrong reason? No problem. Just take this simple, seven-step test:

Seven signs you may have created a Gratuitous Domain Specific Language (GDSL)

  1. You can’t actually explain what a DSL is.
  2. For your DSL, you can’t explain what the domain is.
  3. You have a hard time explaining the DSL’s syntax and semantics.
  4. You have a hard time explaining how the DSL interacts with the language it is embedded in. (For embedded DSLs only.)
  5. A vanilla library API would have captured the domain’s semantics without awkwardness.
  6. It’s easier to express complex domain concepts in general-purpose code than in your DSL.
  7. Your colleagues have a hard time writing things in your DSL.

Did more than a few of the statements ring true? If so, take a bow. You are the proud creator of a Gratuitous DSL!1

Even so, it’s not too late. You can always hop off the DSL Train at the next stop.

(Note for the humor-impaired: This post is meant to be interpreted in tongue-in-cheek fashion.)


1. Rationale for the Seven Signs. Signs 1–4 suggest that your DSL may not even be a DSL, so calling it a DSL is gratuitous. Signs 4–7 suggest that, though your DSL may be real, it may not be paying the bills; thus, creating it and foisting it upon the world was likely gratuitous.

Update: minor edit for clarity.

Update 2008-03-22: edits for clarity.

Posted in ,
Tags , , , , ,
9 comments
no trackbacks
Reddit Delicious

A bright future: security and modern type systems

Posted by Tom Moertel Wed, 15 Aug 2007 20:07:00 GMT

The recent defacement of the United Nations web site is a prime example of why we programmers shouldn’t trust ourselves to write secure code – at least not without our computers’ help. The U.N. web site, according to Slashdot’s coverage of the incident, was defaced by way of a common, well-known attack: SQL injection. What’s interesting is that programmers can render this attack harmless by employing simple, readily available programming tools such as placeholders and prepared statements. Why, then, are so many web sites, including the UN site apparently, still vulnerable?

Some say it’s because the programmers of these sites are incompetent, but that argument ignores that programmers are human, while the security tools we give them offer meaningful protection only if wielded with inhuman perfection. Having the tools to plug security holes, even if the tools are simple to use and readily available, is not enough to ensure that every single security hole will be identified, let alone plugged. Even the most experienced programmer can be expected to overlook a hole now and then. Unfortunately, one hole is all it takes.

That’s because security is not like other software-quality challenges: its costs are fundamentally asymmetric. For the attacker, the bad guy, the challenge is to find just a single exploitable hole. For us, the good guys, the challenge is to achieve perfection: to plug all of the holes in our code, every single one. That’s because attackers, unlike regular users, can be expected to probe our code until they find a hole to exploit.

How then do we ensure that we have plugged every single hole in our code? Testing isn’t sufficient: we can easily overlook holes when writing tests – a perfectly human error. We could supplement testing with code reviews, painstakingly searching for remaining holes while enforcing the use of hole-preventing best practices, but reviews are expensive and, again, subject to human error. A better approach, both less costly and more reliable, is to delegate this burden to our computers, which can do the job correctly, every single time.

This kind of delegation is possible today with modern static type systems. For example, in A Type-Based Solution to the Strings Problem, I offered a tiny “safe strings” library for the Haskell programming language. The library takes advantage of Haskell’s powerful type system to detect unsafe string interactions at compile time. If we faithfully build our code on top of the library, and our code compiles without error, we can be assured that our code is free – completely free – of SQL-injection (and XSS) holes.

While this result is indeed quite beautiful, it certainly isn’t novel. Researchers have been proving interesting properties via type systems for a long time. As Oleg Kiselyov and Chung-chieh Shan pointed out in a comment on my earlier article, the foundational idea is over three decades old.

More recently, Kiselyov and Shan have extended the idea to guarantee more-interesting properties using a trusted kernel and types that represent lightweight static capabilities. The kernel, which is small enough to be reasoned about and formally verified, carefully hands out capabilities to untrusted application code. The untrusted code, in turn, presents the capabilities back to the kernel to invoke operations, which, thanks to the kernel’s trustworthiness, are guaranteed to be safe. (My safe-string library can be seen as a trivial implementation of this programming style.)

When static type systems are used in this way, they don’t merely catch typos and bugs that good testing would have caught as a matter of course, but offer programmers guarantees that would have been impractical to obtain any other way.1 If you consider security important, you might bear this fact in mind when choosing languages for your next project.

Going further, the security benefits of rich static type systems are only now starting to trickle into mainstream industry. As libraries like “safe strings” and idioms like static capabilities become more familiar and get woven into future generations of development frameworks, we can expect marked improvements in the security and robustness of our applications.

In the not-too-distant future, perhaps, we might look back in amazement at the days when important security properties were neither free nor guaranteed but expensive and uncertain, underwritten only by the heroic efforts of individual programmers, struggling against impossible odds to achieve inhuman perfection.

Then again, it sure took garbage collection a long time to catch on.


1. How, for example, could you eliminate the possibility of SQL-injection and XSS holes via testing?

I suppose you could do it if you worked at it hard enough. You could augment your string data structures with run-time information about what they represent: this string represents SQL, this string represents plain-old text, and so on. Then you could redefine your string operations and template interpolation systems to assert that their string inputs were compatible. Of course, if these assertions ever failed, they would do so only at run time, when it would be too late to do anything but die rather ungracefully. So you would be forced to augment your code-coverage tools to ensure that every string-path was covered during testing. That way you could catch all potential run-time string failures – indicating holes – during testing and eliminate the holes (and the subsequent need to fail at run time) before you deployed your application for real.

So, yes, you could do it. But to do so would require you, in effect, to write a crude, single-purpose type system that checks types at test time. That says something, doesn’t it?

Posted in , ,
Tags , , , , ,
9 comments
no trackbacks
Reddit Delicious

A simple directory-tree printer in Haskell

Posted by Tom Moertel Thu, 22 Feb 2007 21:30:00 GMT

At a recent gathering, my friend Casey mentioned that he was learning a new programming language and, as a learning exercise, had written a directory-tree printer. That’s a program that recursively scans directory hierarchies and prints out a tree representation for each:

$ tlist cheating-hangman

cheating-hangman
|-- CVS
|   |-- Entries
|   |-- Repository
|   `-- Root
|-- Makefile
|-- cheating-hangman.lhs
`-- cheating-hangman.pl

I thought the problem sounded like fun, and so I wrote a small solution in Haskell. Let’s take a look.

Read more...

Posted in , ,
Tags , , ,
7 comments
no trackbacks
Reddit Delicious

Introductory Haskell: Solving the Sorting-It-Out Kata

Posted by Tom Moertel Tue, 31 Oct 2006 19:44:00 GMT

Last Tuesday, my friend Casey and I were hanging out at Aldo Coffee. We planned on enjoying some espresso, doing some work, and then heading over to the Pittsburgh Coding Dojo, where we could hang out with other geekly folks. We ended up not having enough time to go to the meeting, but we decided to hack on the challenge problem anyway, using Aldo’s ever-handy free wireless to access the Internet.

The Dojo problem was PragDave’s Kata Eleven – Sorting it Out. (It’s short; read it now.) We decided to use Haskell for our implementation language.

In this post, I’ll walk through our coding session and explain how our solution evolved. To better fit the session into a blog post, I have removed a lot of back-and-forth micro iterations, and I have edited some of the code for clarity.

The first part of the problem

The first part of the problem was “Sorting Balls.” The story: You need to implement a “rack” to hold the balls drawn at random (without replacement) from a bin containing sixty balls, numbered 0 to 59. Regardless of the order in which the balls are added to the rack, you need to present them in sorted order whenever you’re asked for them.

Upon reading this part of the challenge, a couple of thoughts sprung to mind:

  • Because the range of balls is so small, the problem was begging for a solution based on a counting sort.
  • Because the balls are uniquely numbered and drawn without replacement, we could even use a bit vector to represent counts.

Nevertheless, we decided to ignore these thoughts and implement a more-general solution that would work for any (orderable) values, not just small ranges of integers.

Sketching the interface

The first step, then, was to sketch out an interface. Our interface mirrored the one from the problem statement but was tweaked for Haskell:

mkRack :: Rack a
add    :: Ord a => a -> Rack a -> Rack a
balls  :: Rack a -> [a]

The function mkRack makes a new rack to hold values (“balls”) of type a. It’s equivalent to Rack.new in Ruby.

The add function adds a ball to a rack. You give it a ball and a rack, and it returns a new rack that is the same as the original rack but also contains the ball. (If you’re accustomed to stateful programming, this may seem weird. Why return a new rack instead of modifying the original rack? Because, in Haskell, you can’t change values: you can only create new values. At first, this constraint may seem limiting, but after you get used to it, you’ll find it empowering.)

Note: the Ord a qualification on the type signature of add says that it will work for any type a whose values can be ordered. The qualification is necessary because values of some types, like IO actions, cannot be compared to see which are less than the others.

The balls function is an “observer”: it lets you observe the balls in a rack by returning them as an ordered list.

And that’s the interface.

With the interface sketched, we gave it meaning by defining its properties.

Giving our interface meaning: defining properties using QuickCheck

QuickCheck is a powerful, easy-to-use testing tool. Instead of checking test cases, it checks properties – statements about what your code ought to do in general.

The great thing about QuickCheck properties is that they are testable documentation. They tell the world what your code is supposed to do, and they do so in a concise, formal language that just happens to be easily readable by humans and automatically testable by computers.

To specify the desired properties of our Rack interface, we first had to import QuickCheck:

import Test.QuickCheck

Then, we defined our first property. It said, simply, that a new rack must be empty when observed:

prop_New =
    balls mkRack =~ []

Our second property said that, when you add a ball x to a rack, the resulting rack must contain the same balls as the original rack plus x:

prop_AddAddsElement rack x =
    balls (add x rack) =~ (x : balls rack)

Both of the properties above rely upon a special, order-insensitive equality test that we defined for lists of Int values:

(=~) :: [Int] -> [Int] -> Bool
xs =~ ys = sort xs == sort ys

Note that under this test, [1,2] “equals” both [1,2] and [2,1], but it does not “equal” any other values.

The reason we defined this operator was to help us specify the two essential properties of add separately: (1) it must insert a ball into a rack, and (2) the new ball’s position, when observed, must preserve the rack’s ordering invariant. The previous property definition used the =~ operator to specify the first of these two properties. The next property we defined specified the second:

prop_AddPreservesOrdering rack x =
    isOrdered (balls rack) ==> isOrdered (balls (add x rack))

This definition specifies that, for all racks rack and all balls x, if the balls in rack are ordered, the balls in the rack that results from adding x to rack must also be ordered. If you are familiar with proof by induction, you’ll know why we went this route. In short, if we can prove that this property holds (and, trivially, that an empty rack is ordered), we can prove that add preserves the ordering invariant.

To round out the property definition, we needed to define the isOrdered test:

isOrdered :: [Int] -> Bool
isOrdered xs = xs == sort xs

And those are the properties we needed to check the correctness of our implementation. Of course, we still needed to write our implementation, and we turned to that task next.

A simple, list-based Rack implementation

For our first implementation, we decided upon a drop-dead-simple list-based representation. We would keep the elements of the list in sorted order by inserting them into the correct positions when add was called.

Here, then, was our code:

-- Our list-based implementation of a Rack

type Rack a = [a]

mkRack   = []
add x xs = insertList x xs
balls    = id

insertList :: Ord a => a -> [a] -> [a]
insertList x []     = [x]
insertList x (y:ys)
    | x < y         = x : y : ys
    | otherwise     = y : insertList x ys

That’s it.

We took our new implementation for a spin in GHCi:

*Rack> balls mkRack
[]

*Rack> balls (add 3 mkRack)
[3]

*Rack> balls (add 4 (add 3 mkRack))
[3,4]

*Rack> balls (add 1 (add 4 (add 3 mkRack)))
[1,3,4]

*Rack> balls (foldr add mkRack [4,2,6,3,-9,0,33,9])
[-9,0,2,3,4,6,9,33]

To really test our implementation, we asked QuickCheck to check its properties:

*Rack> quickCheck prop_New
OK, passed 100 tests.

*Rack> quickCheck prop_AddAddsElement
OK, passed 100 tests.

*Rack> quickCheck prop_AddPreservesOrdering
OK, passed 100 tests.

I should point out that QuickCheck did not prove that our properties held. Rather, it gathered evidence that we could use to argue that our properties held. The evidence was that each of our properties’ claims was subjected to 100 randomly generated tests, and none of the tests was able to disprove a claim.

Was this evidence sufficient for us to rest satisfied that our implementation was correct? Given how simple our implementation was, I felt that the evidence was sufficient. Casey agreed, and we moved on.

With the first implementation done, we decided to try a more-sophisticated implementation.

Generalizing the interface

Since we were about to have multiple implementations, it made sense for us to define a generalized interface that any “Rack-like” implementation could use. For that, Haskell’s type classes were perfect:

-- Our interface for "Rack-like" data types

class Racklike a ra | ra -> a where
    mkRack :: ra
    add    :: Ord a => a -> ra -> ra
    balls  :: ra -> [a]

The interface was essentially the same as before, except that the data type behind the rack implementation was not given by a specific type Rack a but rather by the type variable ra, which represents some type of rack container for balls of type a.

Note that ra determines a. If, for example, you know that the container type ra equals “a list of Int values,” you know that a must equal Int. (To represent this relationship, we used functional dependencies, a popular extension to the Haskell 98 standard.)

With the Racklike type class in place, we moved our list-based implementation inside of the interface:

-- Our list-based implementation of a Rack

type ListRack a = [a]

instance Racklike a (ListRack a) where
    mkRack = []
    add    = insertList
    balls  = id

Next, we modified our QuickCheck property definitions. Where before it was fine to assume that we would be testing our single, list-based implementation, now we needed to allow for testing other implementation types. We did this by adding a rackType parameter to our property definitions. We used the type, not the value, of this parameter to determine the type of rack to test:

prop_New rackType =
    balls (mkRack `asTypeOf` rackType) =~ []

prop_AddAddsElement rackType ballList x =
    balls (add x rack) =~ (x : balls rack)
  where
    rack = rackFromList ballList `asTypeOf` rackType

prop_AddPreservesOrdering rackType ballList x =
    isOrdered (balls rack) ==> isOrdered (balls (add x rack))
  where
    rack = rackFromList ballList `asTypeOf` rackType

Because we could no longer assume the rack would be represented as a list of integers, we wrote rackFromList to convert such a list into a rack:

rackFromList xs = foldr add mkRack xs

With these modifications in place, we re-ran our tests, specifying (via type annotations) that we wanted to run them for the ListRack implementation:

*Rack> quickCheck $ prop_New (undefined :: ListRack Int)
OK, passed 100 tests.

*Rack> quickCheck $ prop_AddAddsElement (undefined :: ListRack Int)
OK, passed 100 tests.

*Rack> quickCheck $ prop_AddPreservesOrdering (undefined :: ListRack Int)
OK, passed 100 tests.

A tree-based Rack implementation

Now that we were free to add additional implementation types, we created one based on binary trees. We started by defining the tree data type:

data Tree a
    = Empty
    | Root (Tree a) a (Tree a)
    deriving (Ord, Eq, Show)

This definition says that a tree can be either empty or a root node. A root node has a single value and left and right sub-trees.

Further, root nodes must satisfy an ordering invariant: if a root node’s value is x, all of the values in its left subtree must be less than x, and all of the values in its right subtree must be greater than or equal to x. The data type doesn’t enforce this invariant, so we would need to enforce it in our implementation.

Next, we wrote the basic functions for creating, adding elements to, and observing our trees.

We needed to be able to create empty trees:

emptyTree =
    Empty

Inserting an element into a tree requires us to walk the tree and append the element as a new leaf node in the correct location, being mindful of our ordering invariant. Because our data structure is inherently recursive, a recursive implementation was straightforward to code:

insertTree x Empty  = Root Empty x Empty
insertTree x (Root left y right)
    | x < y         = Root (insertTree x left) y right
    | otherwise     = Root left y (insertTree x right)

Note that we don’t try to ensure that the tree is balanced. The problem statement says that the balls are randomly selected, and thus we can expect our trees, on average, to be balanced naturally.

Next, we wrote the code to observe the elements of a tree. We used a functional-programming idiom for efficiently flattening a tree into a list:

elemsTree rx =
    elemsTree' rx []

elemsTree' Empty               = id
elemsTree' (Root left x right) =
    elemsTree' left . (x :) . elemsTree' right

Finally, we defined a new tree-based rack type and declared it to be an instance of the Racklike type class:

type TreeRack a = Tree a

instance Racklike a (TreeRack a) where
    mkRack = emptyTree
    add    = insertTree
    balls  = elemsTree

With the implementation done, we took it for a test drive:

*Rack> add 1 mkRack :: TreeRack Int
Root Empty 1 Empty

*Rack> add 3 (add 1 mkRack) :: TreeRack Int
Root Empty 1 (Root Empty 3 Empty)

*Rack> balls (add 3 (add 1 mkRack) :: TreeRack Int)
[1,3]

Then, for the real test, we checked that our properties held for TreeRacks:

*Rack> quickCheck $ prop_New (undefined :: TreeRack Int)
OK, passed 100 tests.

*Rack> quickCheck $ prop_AddAddsElement (undefined :: TreeRack Int)
OK, passed 100 tests.

quickCheck $ prop_AddPreservesOrdering (undefined :: TreeRack Int)
OK, passed 100 tests.

Satisfied with these results, we moved on to part two of the problem.

The second part of the problem

The second part of the problem was about sorting the letters within a block of text, ignoring white space and punctuation, and converting upper case letters into lower case: “Are there any ways to perform this sort cheaply, and without using built-in libraries?”

Again, a counting sort seemed like an obvious ideal solution, but we decided to recycle our existing code since we had to leave soon. Because our Rack implementations were generic, they would work on letters just as well as on numbers or other kinds of balls:

*Rack> balls (rackFromList "this is a test" :: TreeRack Char)
"   aehiisssttt" 

With our existing code already doing the hard work for us, it was trivial to code up the letter-sorting function:

sortLetters xs =
    balls (rackFromList letters :: TreeRack Char)
  where
    letters = [toLower x | x <- xs, isAlpha x]

(Note: Because of the nature of the problem, I interpreted the question’s “without using built-in libraries” to mean “without built-in sorting libraries.”)

We took the new function for a test drive, and it worked as expected:

*Rack> sortLetters "This is a test, pal." 
"aaehiilpsssttt" 

And that ended our coding session.

Update: Tweaked the revised definition of the AddAddsElement property for greater parallelism with the original.

Update 2007-03-03: Minor edits for clarity.

Posted in , , ,
Tags , , , ,
7 comments
no trackbacks
Reddit Delicious

A type-based solution to the "strings problem": a fitting end to XSS and SQL-injection holes?

Posted by Tom Moertel Thu, 19 Oct 2006 01:40:00 GMT

Even skilled programmers have a hard time keeping their web applications free of XSS and SQL-injection vulnerabilities. And it shows: a sobering portion of web sites are open to some scary security threats.

Why are so many sites vulnerable to these well-known holes? Probably because it’s insanely hard for programmers to solve the fundamental “strings problem” at the heart of these vulnerabilities. The problem itself is easy to understand, but we humans aren’t equipped to carry out the solution. Simply put, we just plain suck at keeping a bazillion different strings straight in our heads, let alone consistently and reliably rendering their interactions safe whenever they cross paths in a modern web application. It’s easy to say, “just escape the little buggers,” but it’s hard to get it right, every single time.

Computers, on the other hand, are pretty good at keeping track of details by the bucket-full. Wouldn’t it be nice, then, if our programming languages gave us the power to delegate this nasty “strings problem” to our computers, which could then devote their unwavering mechanical precision to grinding the problem out of existence? Isn’t that the kind of thing modern programming languages are supposed to be good at?

I’d like to think the answer to that question is a big, you betcha.

So let’s grab a modern programming language and solve the strings problem.

Let’s solve the strings problem in Haskell

In this article, we will look at one way (among many) to solve the strings problem: by adding Ruby-style string templates to Haskell. These templates support “interpolation” via the usual, convenient #{var} syntax, but here interpolation is type safe. Haskell’s type system will prevent us from inadvertently mixing incompatible string types, and it will detect mistakes at compile time, before they can become live XSS or SQL-injection holes. Further, our solution will offer us these benefits without making us jump through hoops or pay some onerous syntax penalty.

To be more specific, the system offers the following benefits:

  • It provides a string-management kernel that lets you create “safe strings” by certifying a regular string as representing either text or a fragment of a known language.
  • It allows you to conveniently define new language types for any string-based language that you can provide an escaping rule for (e.g., XML, URLs, SQL, untrusted user input).
  • It provides compile-time syntactic sugar (via Template Haskell) that makes working with safe strings as convenient as working with string interpolation in languages like Ruby and Perl.
  • It catches and reports (at compile time) the following commonly made programming errors:
    • failing to escape a plain-old-text string before mixing it into a string that represents a language fragment
    • mixing strings that represent fragments of incompatible languages
    • mixing strings that represent fragments of compatible languages in an ambiguous way (the system will force you to disambiguate)

(This is a long one, so grab an espresso, lean back, and read on in style. Also, if you have a smoking jacket, you might want to get it now.)

Read more...

Posted in , , , , , ,
Tags , , , ,
37 comments
no trackbacks
Reddit Delicious

If unit testing can't keep Rails safe from string-escaping problems, what makes you think it will keep your projects safe?

Posted by Tom Moertel Thu, 12 Oct 2006 20:06:00 GMT

Recently I wrote about unit testing being a tool, not a goal in itself. I argued that unit testing was not a reliable way to fight certain kinds of common coding errors and, therefore, that unit testing ought to be supplemented with other tools.

To support my argument, I gave an example of a common, important coding error that unit testing does a bad job of helping programmers control. That error is failing to manage and escape strings properly: the “strings problem.” It is the mother of XSS and SQL-injection security vulnerabilities, not to mention the cause of legions of broken links and bad HTML on the web.

If you think I’m overstating the problem, or if you think that unit testing is a good way of solving it, let me show you how easy it is for even smart developers to get it wrong.

Consider Ruby on Rails, a great framework for developing web applications. Rails has an extensive suite of unit tests, and the Rails development guidelines require that changes to Rails be accompanied by unit tests that “prove [the] change works.”

Now consider that one of Rails’s most-used and most-scrutinized methods – the venerable link_to helper – contains a fundamental string-escaping error:

require 'rubygems'
require_gem 'rails'
include ActionView::Helpers::UrlHelper

url = "http://example.com?ohms_law?volt=1&amp=3" 
puts link_to("TEST", url)

The code, when executed, prints the following HTML snippet:

<a href="http://example.com?ohms_law?volt=1&amp=3">TEST</a>

The HTML snippet represents a hypertext link. The link should point to the URL given in the code, but because the URL was not properly escaped when it was converted into HTML by the link_to helper, the link is broken:

CORRECT:  http://example.com?ohms_law?volt=1&amp=3
LINK_TO:  http://example.com?ohms_law?volt=1&=3
                                             ^ oops

Here’s what’s going on. Because the URL was not escaped, web browsers misinterpret its “amp” parameter as a character-entity reference, which gets gobbled up when the link’s href attribute is parsed. (To see this for yourself, save the output of the Ruby code into an HTML file, open the file with your favorite web browser, and see where the link points.)

Now, how come the unit tests didn’t catch this problem? It turns out, the tests got it wrong, too, by expecting broken output:

# in url_helper_test.rb

def test_link_tag_with_query
  assert_dom_equal \
    "<a href=\"http://www.example.com?q1=v1&amp;q2=v2\">Hello</a>",
    link_to("Hello", "http://www.example.com?q1=v1&amp;q2=v2")
end

The point isn’t that the Rails developers are dumb. The point is that the Rails developers are smart. If they can’t get the strings problem right, even with all their brains and all their unit testing, what reason does any programmer have to think that unit testing is going to solve this problem reliably?

If, then, you want to solve the strings problem – and you really, seriously ought to want to solve the strings problem – you should consider options beyond unit testing.

Update 2007-09-04: I just noticed that the documentation for link_to has been revised to state that if you pass a string as its options parameter, the string will be interpreted not as a URL but as an HTML href attribute value, that is, an HTML-encoded URL. The old documentation:

def link_to(name, options = {}, html_options = nil, *parms)
Creates a link tag of the given name using an URL created by the set of options.... It’s also possible to pass a string instead of an options hash to get a link tag that just points without consideration.

The relevant part of the revised documentation:

It’s also possible to pass a string instead of an options hash to get a link tag that uses the value of the string as the href for the link.

So, according to the updated documentation, the test I described in my article is actually correct. Does this mean that string-handling code is Rails is worry free? The existence of helper methods like fix_double_escape suggests the answer is no.

Posted in , ,
Tags , , , ,
5 comments
no trackbacks
Reddit Delicious

Older posts: 1 2 3