Directory-tree printing in Haskell, part two: refactoring

By
Posted on
Tags: haskell, refactoring, trees, io, directory-tree-series

In my previous article on writing a simple directory-tree printer in Haskell, I wrote a small program to recursively scan file-system directories and print their contents as ASCII-art trees. The program made for an approachable example of how to use Haskell for “imperative” tasks, but it has a few problems.

First, the directory-scanning logic and tree-printing logic are intertwined. Neither is reusable. Second, both bits of logic are rigid, specialized for this particular task. Even if you could reuse them, you wouldn’t want to.

In this article, the second in a series, we will explore ways to make our original code more reusable. We will separate the directory scanning from the tree printing, harness the power of some old friends from Haskell’s libraries, and think about the costs and benefits of our changes.

The plan

Recall our original directory tree–listing solution, the core of which I will reprint below:

tlist path =
    visit (if "/" `isPrefixOf` path then "" else ".") "" "" "" path

visit path leader tie arm node = do
    putStrLn (leader ++ arm ++ tie ++ node)
    visitChildren (path ++ "/" ++ node) (leader ++ extension)
  where
    extension = case arm of ""  -> ""; "`" -> "    "; _   -> "|   "

visitChildren path leader =
    whenM (doesDirectoryExist path) $ do
        contents <- getDirectoryContents path
            `catch` (\e -> return [show e])
        let visibles = sort . filter (`notElem` [".", ".."]) $ contents
            arms = replicate (length visibles - 1) "|" ++ ["`"]
        zipWithM_ (visit path leader "-- ") arms visibles

The tlist function kicks off the process for a particular file-system path, handing off to visit which recursively descends the directory tree from the root node. The visit function calls visitChildren to expand the subtree, if any, for each node visited. The visitChildren function, in turn, calls back to visit to repeat the process for each child in the subtree. In effect, we are traversing the tree rooted at path, printing each node in passing.

To separate the traversal part from the printing part, we will introduce a tree data structure. The file system–traversal code will emit a tree, and the tree-showing code will consume a tree. We will rewrite our old tlist function, which we might as well rename to the more descriptive traverseAndPrint, to glue the two pieces together with the tree serving as glue:

traverseAndPrint :: Path -> IO ()
traverseAndPrint path = do
    tree <- fsTraverse root path
    putStrLn (showTree tree)
  where
    root = if "/" `isPrefixOf` path then "" else "."

That’s the plan. Now let’s carry it out.

Step 1: Introduce the intermediate tree data structure

To make our desired separation possible, let’s introduce the tree data structure. Conveniently enough, the Data.Tree library provides the structure for us:

data Tree a = Node {rootLabel :: a, subForest :: Forest a}

Let’s also define some descriptive type synonyms to help us think about our data:

type Path       = String           -- path
type DentName   = String           -- directory-entry name
type DirNode    = (Path, DentName) -- directory-path/dentname pair
type DirTree    = Tree DentName    -- file-system tree

(Unlike last time, in this article I’ll place type annotations on all top-level declarations to help you see how the data flows between the pieces of our code.)

To give you an idea of what the tree structure looks like, here’s an example DirTree value:

exampleTree :: DirTree
exampleTree = Node "root" [
                Node "subdir1" [
                  Node "file1.1" []
                , Node "file1.2" []
                , Node "subdir1.3" [
                    Node "file1.3.1" [] ] ]
              , Node "file1" [] ]

Now let’s revise our old code to show a tree in this structure.

Step 2: Write the tree-showing code

This step is easy. We just copy and paste our old code, remove the I/O related bits, and tweak for clarity:

-- Purely functional tree-to-string formatting

showTree :: Tree String -> String
showTree t = unlines (showNode "" "" "" t)

showNode :: String -> String -> String -> Tree String -> [String]
showNode leader tie arm node = do
    nodeRep : showChildren node (leader ++ extension)
  where
    nodeRep   = leader ++ arm ++ tie ++ rootLabel node
    extension = case arm of ""  -> ""; "`" -> "    "; _   -> "|   "

showChildren :: Tree String -> String -> [String]
showChildren node leader =
    let children = subForest node
        arms     = replicate (length children - 1) "|" ++ ["`"]
    in  concat (zipWith (showNode leader "-- ") arms children)

There are two main differences between this code and the old code. First, instead of processing directory nodes in the file system, we are processing tree nodes of the Tree data type. Second, because the code is pure, that is, free of side effects, its result is a String, not an IO action.

Lets take this code for a spin in GHCi. We’ll ask it to show our example tree from earlier in the article.

Ok, modules loaded: Main.
*Main> putStr (showTree exampleTree)
root
|-- subdir1
|   |-- file1.1
|   |-- file1.2
|   `-- subdir1.3
|       `-- file1.3.1
`-- file1

Great. Let’s move on.

Step 3: Write the directory-traversal code

In this step, we want to write the code that recursively scans (traverses) a directory hierarchy and returns a DirTree that represents the hierarchy. The following code, which does just that, is a straightforward extraction of our old code’s visit and visitChildren functions:

-- Effectful directory-traversal code that returns a tree
-- representing the directory hierarchy rooted at 'path/node'.

fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse path node =
    Node node `liftM` fsTraverseChildren (path ++ "/" ++ node)

fsTraverseChildren :: Path -> IO (Forest DentName)
fsTraverseChildren path =
    mapM (uncurry fsTraverse) =<< fsGetChildren path

The fsTraverse function takes advantage of liftM to “lift” the pure function (Node node) into the IO monad, where it is applied to the result of the effectful fsTraverseChildren function. It is shorter and more communicative of our intent than the following equivalent code:

do childForest <- fsTraverseChildren (path ++ "/" ++ node)
   return (Node node childForest)

While extracting these functions from our original code, I took the liberty of factoring out the code that grabs the contents of a directory. This code now lives in its own helper function:

-- helper to get traversable directory entries

fsGetChildren :: Path -> IO [DirNode]
fsGetChildren path = do
    contents <- getDirectoryContents path `catch` const (return [])
    let visibles = sort . filter (`notElem` [".", ".."]) $ contents
    return (map ((,) path) visibles)

This bit of code differs from the code in the previous article in that it treats any error encountered when scanning a directory as if the directory were empty. That’s what the

... `catch` const (return [])

addendum does. It says that any error caught while scanning the directory will be handled by a constant policy, that of returning an empty list.

This policy tends to be what we want in practice because the two most common errors we expect to encounter here will be “oops, that’s a file, so it can’t have children” and “oops, you’re not allowed to scan that directory.” This error-handling policy will, however, throw away potentially useful information, say, in the rare case that the error is something like “oops, your hard drive caught on fire and is about to explode: start running now.” (Maybe we’ll revisit our policy later.)

Step 4: Connect the traversal code to the printing code

We can traverse file-system hierarchies to produce trees. We can print trees. To traverse and print a hierarchy, then, requires nothing but a little glue. Conveniently, we have already written the needed glue, traverseAndPrint, when we formulated our plan. Here is a slightly more concise version:

traverseAndPrint :: Path -> IO ()
traverseAndPrint path =
    putStr . showTree =<< fsTraverse root path
  where
    root = if "/" `isPrefixOf` path then "" else "."

Now we have a complete solution. Let’s give it a spin in GHCi. I’ll traverse and print the Darcs repository that holds the code for this article.

*Main> traverseAndPrint "_darcs"
_darcs
|-- inventories
|-- inventory
|-- patches
|   |-- 20070226210515-8fbd9-42e4...74102.gz
|   |-- 20070226210634-8fbd9-86cd...490ef.gz
|   |-- 20070227053004-8fbd9-122b...5023d.gz
|   |--   [ entries omitted for brevity ]
|   |-- 20070227233827-8fbd9-94ab...4852f.gz
|   |-- 20070228181747-8fbd9-db9b...e8ce6.gz
|   |-- 20070303192612-8fbd9-ff2a...3d7bb.gz
|   `-- pending
|-- prefs
|   |-- author
|   |-- binaries
|   |-- boring
|   `-- motd
`-- pristine
    |-- Makefile
    `-- tlist.hs

(I shortened some of the longer file names to prevent them from escaping the narrow text column of my blog.)

Very well. Now let’s see if we can tighten our code.

Step 5: Simplify

One benefit of breaking monolithic code into composable pieces is that the pieces are often amenable to further simplification. Our fsTraverse function, for example, was extracted from our old code, where it combined directory-scanning and tree-printing knowledge. Now, the tree-printing knowledge factored out, it combines directory-scanning and tree-building knowledge. But the Tree data type, as you might expect, already contains that second piece of knowledge. So we can simplify fsTraverse further by removing this duplicated know-how.

The function that makes the simplification possible is unfoldTreeM:

*Main> :t unfoldTreeM
unfoldTreeM :: (Monad m) => (b -> m (a, [b])) -> b -> m (Tree a)

This function, which is parametrized by a step function, grows a tree from a single seed value. The step function is what we provide to guide each step of the process. It consumes a seed value and produces a tree node and a list of seeds from which the node’s children can be grown. The unfoldTreeM function calls our step function recursively on the seeds until the entire tree is grown.

Our directory-scanning step function parallels our earlier fsTraverse function, which I will reprint below for comparison, but eliminates knowledge of the Tree data structure.

fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse path node =
    Node node `liftM` fsTraverseChildren (path ++ "/" ++ node)

fsTraverseStep :: DirNode -> IO (DentName, [DirNode])
fsTraverseStep (path, node) =
    (,) node `liftM` fsGetChildren (path ++ "/" ++ node)

Note that a DirNode is simply a pair. Thus the first argument of fsTraverseStep is equivalent to the first two arguments of fsTraverse. This pairing is convenient because the seed values we produce are also DirNodes, and this parity is required by the type of unfoldTreeM.

Now we can use unfoldTreeM to build file-system hierarchies. Let’s try it using GHCi:

*Main> unfoldTreeM fsTraverseStep (".", "_darcs")
Node {rootLabel = "_darcs", subForest = [ {- more nodes here -} ]}

The final step of our simplification is to rewrite fsTraverse in terms of unfoldTreeM:

fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse = curry (unfoldTreeM fsTraverseStep)

Note the use of curry to adapt unfoldTreeM’s arguments into the form implied by fsTraverse’s type signature.

The final code

Our simplification done, here is our complete, refactored program:

module Main (main) where

import Control.Monad
import Data.List
import Data.Tree
import System.Directory
import System.Environment


-- Some convenient type synonyms

type Path       = String           -- path
type DentName   = String           -- directory-entry name
type DirNode    = (Path, DentName) -- directory-path/dentname pair
type DirTree    = Tree DentName    -- file-system tree


-- High-level program logic:  get args and print a tree for each

main :: IO ()
main = do
    args <- getArgs
    mapM_ traverseAndPrint (if null args then ["."] else args)

traverseAndPrint :: Path -> IO ()
traverseAndPrint path =
    putStr . showTree =<< fsTraverse root path
  where
    root = if "/" `isPrefixOf` path then "" else "."


-- Effectful tree-builder for file-system hierarchies

fsTraverse :: Path -> DentName -> IO DirTree
fsTraverse = curry (unfoldTreeM fsTraverseStep)

fsTraverseStep :: DirNode -> IO (DentName, [DirNode])
fsTraverseStep (path, node) =
    (,) node `liftM` fsGetChildren (path ++ "/" ++ node)


-- Helper to get traversable directory entries

fsGetChildren :: Path -> IO [DirNode]
fsGetChildren path = do
    contents <- getDirectoryContents path `catch` const (return [])
    let visibles = sort . filter (`notElem` [".", ".."]) $ contents
    return (map ((,) path) visibles)


-- Purely functional tree-to-string formatting

showTree :: Tree String -> String
showTree t = unlines (showNode "" "" "" t)

showNode :: String -> String -> String -> Tree String -> [String]
showNode leader tie arm node =
    nodeRep : showChildren node (leader ++ extension)
  where
    nodeRep   = leader ++ arm ++ tie ++ rootLabel node
    extension = case arm of ""  -> ""; "`" -> "    "; _   -> "|   "

showChildren :: Tree String -> String -> [String]
showChildren node leader =
    let children = subForest node
        arms     = replicate (length children - 1) "|" ++ ["`"]
    in  concat (zipWith (showNode leader "-- ") arms children)

Let’s review

We have refactored our original, small, monolithic solution into a collection of more reusable code snippets. The total “cost” of our new solution is certainly higher than before – the code is twice as long – but we can reasonably expect that by reusing this code we can reduce the cost of future solutions that might require similar functionality.

Say, for example, we need to write some code to count the files and directories within a hierarchy. Building upon our new code, we can write it easily, as a one-liner in GHCi even:

*Main> print . length . flatten =<< fsTraverse "." "_darcs"
27

Still, there is much about our code that could be better. First, unlike our old solution, our new solution builds the entire tree before it begins to emit output. If the tree is large, its memory footprint could be daunting. Second, our error-handling policy is mildly embarrassing. Third, our directory-traversal code isn’t flexible. What if we want to get the size of each file we visit during the traversal? Right now, there’s no easy way to do it.

Fortunately, we aren’t done yet. Haskell offers us many opportunities for abstraction and parametrization that we are not taking advantage of. In the next article, we’ll exploit a few of these opportunities to make our code a little more flexible and a little more idiomatic.

See you then.