Directory-tree printing in Haskell, part two: refactoring
Posted by Tom Moertel Wed, 07 Mar 2007 21:04:00 GMT
In my previous article on writing a simple directory-tree printer in Haskell, I wrote a small program to recursively scan file-system directories and print their contents as ASCII-art trees. The program made for an approachable example of how to use Haskell for “imperative” tasks, but it has a few problems.
First, the directory-scanning logic and tree-printing logic are intertwined. Neither is reusable. Second, both bits of logic are rigid, specialized for this particular task. Even if you could reuse them, you wouldn’t want to.
In this article, the second in a series, we will explore ways to make our original code more reusable. We will separate the directory scanning from the tree printing, harness the power of some old friends from Haskell’s libraries, and think about the costs and benefits of our changes.
The plan
Recall our original directory tree–listing solution, the core of which I will reprint below:
tlist path =
visit (if "/" `isPrefixOf` path then "" else ".") "" "" "" path
visit path leader tie arm node = do
putStrLn (leader ++ arm ++ tie ++ node)
visitChildren (path ++ "/" ++ node) (leader ++ extension)
where
extension = case arm of "" -> ""; "`" -> " "; _ -> "| "
visitChildren path leader =
whenM (doesDirectoryExist path) $ do
contents <- getDirectoryContents path
`catch` (\e -> return [show e])
let visibles = sort . filter (`notElem` [".", ".."]) $ contents
arms = replicate (length visibles - 1) "|" ++ ["`"]
zipWithM_ (visit path leader "-- ") arms visibles
The tlist function kicks off the process for a particular file-system path, handing off to visit which recursively descends the directory tree from the root node. The visit function calls visitChildren to expand the subtree, if any, for each node visited. The visitChildren function, in turn, calls back to visit to repeat the process for each child in the subtree. In effect, we are traversing the tree rooted at path, printing each node in passing.
To separate the traversal part from the printing part, we will introduce a tree data structure. The file system–traversal code will emit a tree, and the tree-showing code will consume a tree. We will rewrite our old tlist function, which we might as well rename to the more descriptive traverseAndPrint, to glue the two pieces together with the tree serving as glue:
traverseAndPrint :: Path -> IO ()
traverseAndPrint path = do
tree <- fsTraverse root path
putStrLn (showTree tree)
where
root = if "/" `isPrefixOf` path then "" else "."
That’s the plan. Now let’s carry it out.
Step 1: Introduce the intermediate tree data structure
To make our desired separation possible, let’s introduce the tree data
structure. Conveniently enough, the Data.Tree library
provides the structure for us:
data Tree a = Node {rootLabel :: a, subForest :: Forest a}
Let’s also define some descriptive type synonyms to help us think about our data:
type Path = String -- path
type DentName = String -- directory-entry name
type DirNode = (Path, DentName) -- directory-path/dentname pair
type DirTree = Tree DentName -- file-system tree
(Unlike last time, in this article I’ll place type annotations on all top-level declarations to help you see how the data flows between the pieces of our code.)
To give you an idea of what the tree structure looks like, here’s an example DirTree value:
exampleTree :: DirTree
exampleTree = Node "root" [
Node "subdir1" [
Node "file1.1" []
, Node "file1.2" []
, Node "subdir1.3" [
Node "file1.3.1" [] ] ]
, Node "file1" [] ]
Now let’s revise our old code to show a tree in this structure.
Step 2: Write the tree-showing code
This step is easy. We just copy and paste our old code, remove the I/O related bits, and tweak for clarity:
-- Purely functional tree-to-string formatting
showTree :: Tree String -> String
showTree t = unlines (showNode "" "" "" t)
showNode :: String -> String -> String -> Tree String -> [String]
showNode leader tie arm node = do
nodeRep : showChildren node (leader ++ extension)
where
nodeRep = leader ++ arm ++ tie ++ rootLabel node
extension = case arm of "" -> ""; "`" -> " "; _ -> "| "
showChildren :: Tree String -> String -> [String]
showChildren node leader =
let children = subForest node
arms = replicate (length children - 1) "|" ++ ["`"]
in concat (zipWith (showNode leader "-- ") arms children)
There are two main differences between this code and the old code. First, instead of processing directory nodes in the file system, we are processing tree nodes of the Tree data type. Second, because the code is pure, that is, free of side effects, its result is a String, not an IO action.
Lets take this code for a spin in GHCi. We’ll ask it to show our example tree from earlier in the article.
Ok, modules loaded: Main.
*Main> putStr (showTree exampleTree)
root
|-- subdir1
| |-- file1.1
| |-- file1.2
| `-- subdir1.3
| `-- file1.3.1
`-- file1
Great. Let’s move on.
Step 3: Write the directory-traversal code
In this step, we want to write the code that recursively scans (traverses) a directory hierarchy and returns a DirTree that represents the hierarchy. The following code, which does just that, is a straightforward extraction of our old code’s visit and visitChildren functions:
-- Effectful directory-traversal code that returns a tree
-- representing the directory hierarchy rooted at 'path/node'.
fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse path node =
Node node `liftM` fsTraverseChildren (path ++ "/" ++ node)
fsTraverseChildren :: Path -> IO (Forest DentName)
fsTraverseChildren path =
mapM (uncurry fsTraverse) =<< fsGetChildren path
The fsTraverse function takes advantage of liftM to “lift” the pure function (Node node) into the IO monad, where it is applied to the result of the effectful fsTraverseChildren function. It is shorter and more communicative of our intent than the following equivalent code:
do childForest <- fsTraverseChildren (path ++ "/" ++ node)
return (Node node childForest)
While extracting these functions from our original code, I took the liberty of factoring out the code that grabs the contents of a directory. This code now lives in its own helper function:
-- helper to get traversable directory entries
fsGetChildren :: Path -> IO [DirNode]
fsGetChildren path = do
contents <- getDirectoryContents path `catch` const (return [])
let visibles = sort . filter (`notElem` [".", ".."]) $ contents
return (map ((,) path) visibles)
This bit of code differs from the code in the previous article in that it treats any error encountered when scanning a directory as if the directory were empty. That’s what the
... `catch` const (return [])
addendum does. It says that any error caught while scanning the directory will be handled by a constant policy, that of returning an empty list.
This policy tends to be what we want in practice because the two most common errors we expect to encounter here will be “oops, that’s a file, so it can’t have children” and “oops, you’re not allowed to scan that directory.” This error-handling policy will, however, throw away potentially useful information, say, in the rare case that the error is something like “oops, your hard drive caught on fire and is about to explode: start running now.” (Maybe we’ll revisit our policy later.)
Step 4: Connect the traversal code to the printing code
We can traverse file-system hierarchies to produce trees. We can print trees. To traverse and print a hierarchy, then, requires nothing but a little glue. Conveniently, we have already written the needed glue, traverseAndPrint, when we formulated our plan. Here is a slightly more concise version:
traverseAndPrint :: Path -> IO ()
traverseAndPrint path =
putStr . showTree =<< fsTraverse root path
where
root = if "/" `isPrefixOf` path then "" else "."
Now we have a complete solution. Let’s give it a spin in GHCi. I’ll traverse and print the Darcs repository that holds the code for this article.
*Main> traverseAndPrint "_darcs"
_darcs
|-- inventories
|-- inventory
|-- patches
| |-- 20070226210515-8fbd9-42e4...74102.gz
| |-- 20070226210634-8fbd9-86cd...490ef.gz
| |-- 20070227053004-8fbd9-122b...5023d.gz
| |-- [ entries omitted for brevity ]
| |-- 20070227233827-8fbd9-94ab...4852f.gz
| |-- 20070228181747-8fbd9-db9b...e8ce6.gz
| |-- 20070303192612-8fbd9-ff2a...3d7bb.gz
| `-- pending
|-- prefs
| |-- author
| |-- binaries
| |-- boring
| `-- motd
`-- pristine
|-- Makefile
`-- tlist.hs
(I shortened some of the longer file names to prevent them from escaping the narrow text column of my blog.)
Very well. Now let’s see if we can tighten our code.
Step 5: Simplify
One benefit of breaking monolithic code into composable pieces is that the pieces are often amenable to further simplification. Our fsTraverse function, for example, was extracted from our old code, where it combined directory-scanning and tree-printing knowledge. Now, the tree-printing knowledge factored out, it combines directory-scanning and tree-building knowledge. But the Tree data type, as you might expect, already contains that second piece of knowledge. So we can simplify fsTraverse further by removing this duplicated know-how.
The function that makes the simplification possible is unfoldTreeM:
*Main> :t unfoldTreeM
unfoldTreeM :: (Monad m) => (b -> m (a, [b])) -> b -> m (Tree a)
This function, which is parametrized by a step function, grows a tree from a single seed value. The step function is what we provide to guide each step of the process. It consumes a seed value and produces a tree node and a list of seeds from which the node’s children can be grown. The unfoldTreeM function calls our step function recursively on the seeds until the entire tree is grown. (If our step function were not effectful, requiring I/O, we could use the pure function unfoldTree instead of unfoldTreeM to build our trees.)
Our directory-scanning step function parallels our earlier fsTraverse function, which I will reprint below for comparison, but eliminates knowledge of the Tree data structure.
fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse path node =
Node node `liftM` fsTraverseChildren (path ++ "/" ++ node)
fsTraverseStep :: DirNode -> IO (DentName, [DirNode])
fsTraverseStep (path, node) =
(,) node `liftM` fsGetChildren (path ++ "/" ++ node)
Note that a DirNode is simply a (Path, DentName) pair. Thus the first argument of fsTraverseStep is equivalent to the first two arguments of fsTraverse. This pairing is convenient because the seed values we produce are also DirNodes, and this parity is required by the type of unfoldTreeM. (Had our step function retained the two-argument form, we could have used the uncurry combinator to transform it into the form required for unfoldTreeM. Later, we’ll use curry to perform the opposite transformation.)
Now we can use unfoldTreeM to build file-system hierarchies. Let’s try it using GHCi:
*Main> unfoldTreeM fsTraverseStep (".", "_darcs")
Node {rootLabel = "_darcs", subForest = [ {- more nodes here -} ]}
The final step of our simplification is to rewrite fsTraverse in terms of unfoldTreeM:
fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse = curry (unfoldTreeM fsTraverseStep)
Note the use of curry to adapt unfoldTreeM’s arguments into the form implied by fsTraverse’s type signature.
The final code (for now)
Our simplification done, here is our complete, refactored program:
module Main (main) where
import Control.Monad
import Data.List
import Data.Tree
import System.Directory
import System.Environment
-- Some convenient type synonyms
type Path = String -- path
type DentName = String -- directory-entry name
type DirNode = (Path, DentName) -- directory-path/dentname pair
type DirTree = Tree DentName -- file-system tree
-- High-level program logic: get args and print a tree for each
main :: IO ()
main = do
args <- getArgs
mapM_ traverseAndPrint (if null args then ["."] else args)
traverseAndPrint :: Path -> IO ()
traverseAndPrint path =
putStr . showTree =<< fsTraverse root path
where
root = if "/" `isPrefixOf` path then "" else "."
-- Effectful tree-builder for file-system hierarchies
fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse = curry (unfoldTreeM fsTraverseStep)
fsTraverseStep :: DirNode -> IO (DentName, [DirNode])
fsTraverseStep (path, node) =
(,) node `liftM` fsGetChildren (path ++ "/" ++ node)
-- Helper to get traversable directory entries
fsGetChildren :: Path -> IO [DirNode]
fsGetChildren path = do
contents <- getDirectoryContents path `catch` const (return [])
let visibles = sort . filter (`notElem` [".", ".."]) $ contents
return (map ((,) path) visibles)
-- Purely functional tree-to-string formatting
showTree :: Tree String -> String
showTree t = unlines (showNode "" "" "" t)
showNode :: String -> String -> String -> Tree String -> [String]
showNode leader tie arm node =
nodeRep : showChildren node (leader ++ extension)
where
nodeRep = leader ++ arm ++ tie ++ rootLabel node
extension = case arm of "" -> ""; "`" -> " "; _ -> "| "
showChildren :: Tree String -> String -> [String]
showChildren node leader =
let children = subForest node
arms = replicate (length children - 1) "|" ++ ["`"]
in concat (zipWith (showNode leader "-- ") arms children)
Let’s review
We have refactored our original, small, monolithic solution into a collection of more reusable code snippets. The total “cost” of our new solution is certainly higher than before – the code is twice as long – but we can reasonably expect that by reusing this code we can reduce the cost of future solutions that might require similar functionality.
Say, for example, we need to write some code to count the files and directories within a hierarchy. Building upon our new code, we can write it easily, as a one-liner in GHCi even:
*Main> print . length . flatten =<< fsTraverse "." "_darcs"
27
Still, there is much about our code that could be better. First, unlike our old solution, our new solution builds the entire tree before it begins to emit output. If the tree is large, its memory footprint could be daunting. Second, our error-handling policy is mildly embarrassing. Third, our directory-traversal code isn’t flexible. What if we want to get the size of each file we visit during the traversal? Right now, there’s no easy way to do it.
Fortunately, we aren’t done yet. Haskell offers us many opportunities for abstraction and parametrization that we are not taking advantage of. In the next article, we’ll exploit a few of these opportunities to make our code a little more flexible and a little more idiomatic.
See you then.
readers


Hi Tom,
I don’t understand why this version has to read the whole tree before it begins to emit output. Doesn’t Haskell’s laziness help here? showTree looks like it builds a list incrementally, so I would have thought that the first output line could be emitted before the entire tree is produced. In fact, as soon as the first Node of the tree is created, couldn’t this be consumed by showTree?
As a newcomer to Haskell, laziness is one of the more difficult subjects to understand (next to Monads). I’d appreciate any insight into the matter.
Thanks, Pete
Pete, that’s a great question. For the full answer, wait for the next article in this ever-growing series. Until then, here’s the short answer.
The reason why this version emits the whole tree at once is because of the ordering guarantees provided by the IO monad. If you look at the type of fsTraverse,
you’ll see that it produces an I/O action that, when performed, returns a DirTree. If you sequence that action with another I/O action, say one that prints the tree, the first action’s effects (in this case I/O) are guaranteed to have been completed before the second’s begin to take place. Thus all of the directory scanning must be finished before the first output from the formatted tree can be emitted.
Some of the standard library functions, however, use an interleaving trick to produce I/O actions that immediately return values which can be consumed lazily. The ever-popular getContents and readFile functions are good examples.
In the next article in this series, I’ll show how we can use the same trick to restore the lean memory characteristics of our original implementation.
Cheers,
Tom
Thank you, that makes sense. It also explains some things I’ve seen in my own code.
One followup question though, does this also apply to non-IO monads as well? For example, let’s pretend that fsTraverse worked outside of the IO monad, but for other reasons we wanted to have access to some configuration that we access from inside the Reader monad:
Would this also prevent the tree from being lazily consumed also assuming that showTree was in the Reader monad? I ask only because it seems when I read something about monads, it varies depending on whether or not it refers to the IO monad.
Thanks, Pete
Pete,
That’s another great question. Before I answer, let me say that despite how complicated this explanation may seem, with a little time, all of this monad and I/O business becomes intuitive.
Now, the answer.
Different monads provide different rules that govern how their actions are interpreted. The IO monad’s rules say that if you combine two actions
the side effects produced by action1 will be finished before the side effects of action2 begin. This ordering rule applies to the side effects, not necessarily to the Haskell code and data that interact with those effects. The interactions, however, often force our code and data to be evaluated and constructed in parallel with the effects, thus imposing an ordering on them, too.
In our tree-building code, for example, a tree is constructed by a single IO action created by fsTraverse. When that action is sequenced with the action of printing the tree, all of the tree-building side effects are forced to occur before the first printing side effect. But because the tree-building side effects are intertwined with the code for building the tree data structure, forcing the all of the side effects to occur also forces the all of the tree to be constructed.
Other monads offer different rules, and most do not have the profound ordering effects on intertwined code that the IO monad does. As an example, let me write some code to build an infinite tree as a single monadic action:
We will run this code in different monads to see whether we can print the top of the tree before the entire tree is constructed, which will take forever since the tree is infinite. That is, if we can print the first five lines of the tree, we know the tree was produced lazily.
I’ll write some code to print the top of a tree.
As you might expect, when we build our infinite tree as an Identity monad action, which is equivalent to running a “pure” version of our code, the tree is constructed lazily, with no unexpected sequencing:
To answer your question, What about the Reader monad?
So the Reader monad does not interfere with the lazy construction of our tree.
But when we force the tree-building to occur as an IO action, the monad’s sequencing guarantees prevent any printing from occurring before the entire infinite tree is constructed:
In the end, I had to interrupt the computation to regain control of my GHCi session.
In sum, the IO monad often forces our code to be evaluated in a certain sequence by forcing the side effects that intertwine with our code to be sequenced. Most other monads do not have such drastic effects on evaluation order.
I hope this answers your question.
Cheers, Tom
Tom,
Thank you once again for the lucid explanation. It’s starting to make sense to me. With that said, I have one final question along this line of thought. The small program below prints the top 5 lines of all files specified as arguments. It uses readFile which as you state uses the interleaving trick to allow the output of it to be consumed lazily.
module Main where import System.Environment main = mapM_ doFile =<< getArgs where doFile f = do contents <- readFile f putStr . unlines . take 5 . lines $ contentsIf I invoke this program with two arguments and assuming the first of which is a very, very large file, is this interleaving trick smart enough so that readFile does not have to read the entire file? Just to be clear, I’m not asking whether or not putStr can start consuming before reading the entire file, but rather does the current invocation of readFile have to complete in its entirety before the next invocation occurs?
Thanks, Pete
Pete,
No, you don’t have to read the entire contents of the string returned by readFile. Once you are done reading from the string, the underlying handle that was used to feed you data lazily can be reclaimed, and its finalizer will close the file.
One good way to see this behavior is to run your program on two files, the first of which is of “infinite” length. Here I’ll use the Unix
yesprogram to generate an infinite-length stream of “y” lines. Let’s see if we can “take 5” from this stream and then move on to another file without getting hung up:As you can see, the program did not need to read the first file to exhaustion, which would have taken forever. Instead, it read as much as it wanted (five lines) and moved on to the next file (which happened to be its own source code).
Cheers,
Tom
Great! Thanks again Tom! This has been extremely helpful. Sorry for hijacking the comments.
Now if you could only decipher Monads for me. Don Stewart’s recent blog entry on using Haskell for shell scripting has my head spinning … all those type signatures.