« Reading and writing consensus trees in HaskellConsensus trees implemented in Haskell »

Consensus trees implemented in Haskell, imperatively

01/03/10

Permalink 11:30:27 am, 1177 words
Categories: Code, Linguistics

Consensus trees implemented in Haskell, imperatively

Last time I showed a functional implementation of consensus trees that ended up having to pass state around. It was kind of ugly. Fortunately, this pattern is common in Haskell, so there is a wrapper for it in the form of the State monad.

If you haven’t used the State monad before, it gives you one (1) variable—it may not sound like much, but the number was zero (0) before. All state-changing functions implicitly operate on this variable. This makes your code look a lot like Perl code (where the one variable is named $_, if you ever bother naming it). I apologise in advance if this gives you fever or convulsions.

The only code that changes is consensus, buildNode and buildTree, so you’ll have to refer to the previous post for the rest of the code. Here are the functional originals:

consensus trees = majority (map root trees) |> buildTree |> fst
buildTree [span] = (Leaf span, [])
buildTree (span:ranks) = let (kids, rest) = buildNode span ranks in
                         first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree (span:rest))
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

Before changing this code to use the State monad, let’s get started with some utility code. I think it’ll help you understand State if you haven’t worked with it before.

isStackEmpty = return . (==[]) =<< get
peek = return . head =<< get
pop = do
  first:rest <- get
  put rest
  return first
push x = put . (x:) =<< get
ifM cond seq alt = do
  b <- cond
  if b then seq else alt

If you remember from last time, we were passing around a tuple. The second element of the tuple always holds the remaining spans that are potential children for the current span. Here, that list of spans is stuffed into the State monad. Then, the State accessor functions like get and put let you mutate this list.

Internally, I believe that State is doing exactly the same tuple-passing trick as the original code. At least, that’s the way it’s always been explained to me. I suppose it could be creating a Real Mutable Cell for me. It would come out to the same thing.

Anyway, isStackEmpty is representative: get the span list, check to see if it’s empty, and return. peek, push and pop are variants on the same thing.

ifM just embeds if in a monad. Not too much to say about it.

OK, let’s port that code to the State monad, starting from the top down. consensus doesn’t change much. It runs evalState instead of fst, and that’s about it. evalState takes two arguments: a function to run in the State monad, and the initial value of the state variable. It returns the value of the function and throws away the state variable. (Conversely, execState throws away the value of the function and returns the state variable.)

consensus trees = evalState (buildTree span) ranks
  where (span:ranks) = majority (map root trees)

Next up is buildTree:

buildTree span =
  ifM isStackEmpty
    (return (Leaf span))
    (do
       kids <- buildNode span
       nodes <- mapM buildTree kids
       return (Node span nodes))

In buildTree, pattern matching goes away but the intent of isStackEmpty is the same. The true branch is also basically the same. However, the false branch is markedly simpler and the order of evaluation is clear: first find the current span’s kids, then build the nodes for each of the kids, then put them together in a new Node. All the shenanigans with first and foldl go away, because mapM gets the State monad to do all the tuple mangling in the background.

Of course, if you like putting everything on one line (like me), you can do that too…after all, do is just sugar for a bunch of higher order functions.

buildTree span =
  ifM isStackEmpty
    (return (Leaf span))
    ((return . Node span <=< mapM buildTree) =<< buildNode span)

Note that, as ugly as the one-liner is, if it weren’t monadic, it would be

(return . Node span . mapM buildTree) $ buildNode span

Haskell just forces us to mark compose and apply differently if they’re monadic or not. ((.) vs (<=<) and ($) vs (=<<)). This means that you have to learn two versions of some functions and know when to use them. On the plus side, you can be very sure where you are accessing state.

Whether this is important is a philosophical/cultural question more than a linguistic one: we’re edging into Sapir-Whorf territory here. For comparison, when writing C, I always know when I’m referring to something by reference versus by value because I have to very carefully mark pointers. There are even two syntaxes for attribute access (x.y vs p->y). Again, whether this is important should be decided mostly by your problem—it’s not possible to say whether marking by-reference is universally more important than marking state changes.

OK, end of digression. Let’s see how buildNode changes:

buildNode span =
  ifM isStackEmpty
    (return [])
    (ifM (return . (`Set.isSubsetOf` span) =<< peek)
         (do
           rank <- pop
           span' <- buildNode (span `Set.difference` rank)
           return (rank:span'))
         (do
           rank <- pop
           span' <- buildNode span
           push rank
           return span'))

Again, the empty check and terminal case are pretty much the same. However, the subset check is more complicated than before because of more monadic composition.

As with buildTree, the order of events is much clearer. Especially in the non-subset case, it’s much clearer that you pop off the current span, look at the rest, then push it back on when the recursive call returns. The difference between the two cases is maybe not quite as clear since they don’t sit on adjacent lines.

And of course you can write more of buildNode on one line too*:

buildNode span =
  ifM isStackEmpty
    (return [])
    (ifM (return . (`Set.isSubsetOf` span) =<< peek)
         (liftM2 (:) peek (buildNode . Set.difference span =<< pop))
         (do
           rank <- pop
           span' <- buildNode span
           push rank
           return span'))

So that’s the imperative version. It’s easier to read, but I had more trouble thinking about how to write it. It was easier to stop and think about the types when I was manually passing around the state. Also, the type errors while using State seemed pretty hard to understand. Probably I haven’t used it enough to get used to them.

Next time, the stupid script wrapper that goes around the whole thing. Not particularly impressive, but I haven’t blogged any of my small Haskell “scripts” that do actual file-to-file transformations. Usually pretty simple, but just complicated enough that I want to use Haskell instead of Python.

*If you saw cond in there, you’re not alone:

t = return True
condM [] = error "Should have used t"
condM ((test,action):conditions) = ifM test action (condM conditions)
buildNode' span =
  condM [(isStackEmpty, return [])
        ,(return . (`Set.isSubsetOf` span) =<< peek,
          (liftM2 (:) peek (buildNode' . Set.difference span =<< pop)))
        ,(t, do
            rank <- pop
            span' <- buildNode' span
            push rank
            return span')]

No feedback yet

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
multi-blog