« Consensus trees implemented in Haskell, imperativelySyntactic distance in C++ »

Consensus trees implemented in Haskell

26/02/10

Permalink 03:39:32 pm, 1580 words
Categories: Code, Linguistics

Consensus trees implemented in Haskell

An implementation of consensus trees in Haskell

I’ve used hierachical clusters for the last few years as a part of my analysis of distance measures. But they are not super reliable—small changes tend to make corpora jump from cluster to cluster. One way to fix this is to introduce noise into the data and build a consensus tree. A majority consensus tree only creates clusters that the majority of hierarchical clusters agree on.

[PDF] Example of a single hierarchical cluster dendrogram

The problem is that this is a well-understood but obscure 30-year-old concept. You know what that means: the only off-the-shelf implementations are either (1) twenty years old (2) hopelessly obfuscated variants focussing on efficiency or (3) part of some custom research code that some scientists just threw online.

Yeah, so you’re going to get (3) in this post. Actually it turns out to be pretty easy to build consensus trees, so I guess that’s why there aren’t any good implementations out there.

data Tree a = Leaf a | Node a [Tree a] deriving (Show)
root tree = Node "ROOT" [Leaf "s0", tree]
leaf (Leaf _) = True
leaf (Node _ []) = True -- irregularities in buildNode, oh well
leaf _ = False

The shenanigans start early with two representations of leaf nodes sneaking in. I should probably just standardise on the (Node _ []) representation but having a Leaf constructor around makes literals easier to type. Fighting for the side of consistency, root provides a consistent root in case NONE of the trees can agreement on a root.

consensus trees = buildTree span rest |> fst
  where (span:rest) = majority (map root trees)
majority trees = trees
               |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
               |> Map.unionsWith (+)
               |> Map.filter (>m)
               |> Map.keys
               |> sortBy (comparing (negate . Set.size))
 where m = floor (fromIntegral (length trees) / 2)
spans (Leaf a) = Set.singleton $ Set.singleton a
spans (Node a kids) = Set.insert (Set.unions $ Set.toList kidspans) kidspans
  where kidspans = Set.unions $ map spans kids

This is the first half of the code: consensus finds the majority spans and rebuilds them into a tree. Spans are fairly simple: it’s just the leaves that a particular node dominates. For example, the green node in the first tree has the span {a b c} and the green node in the second tree has the span {b c}.

spans represents spans with sets so that it can glom all the subspans together with a big Set.unions and be sure that there aren’t any dupes. The recursive case figures the spans for all kids, then unions them together, calls that the parent’s span and adds it to the kids’ spans.

Since Haskell doesn’t have list literals, the output is kind of messy, but if you convert to lists, the output from the example tree is:

[["a"]
, ["a", "b", "c"]
, ["a", "b", "c", "d"]
, ["b"]
, ["b", "c"]
, ["c"]
, ["d"]]

Now if you run that on the following three trees, you’ll get three slightly different sets.

Once you combine those sets, the question is which spans are present in the majority of trees. For example, {a b c d} is present in all three trees. But {a b} is only present in tree 1, while {b c} is present in both tree 2 and tree 3. So {b c} will be part of the consensus tree (2/3 > 50%), but {a b} won’t (1/3 < 50%).

After calling spans, majority combines everything into one map via some gymnastics:

... |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
    |> Map.unionsWith (+)

In other words, map each span in spans to 1. Then combine the spans into one map using (+) to combine identical entries.

Next we filter for all the spans that occur in the majority of trees and then throw away the counts (we never really cared about them anyway).

... |> Map.filter (>m)
    |> Map.keys

Finally, we sort the sets by size. For the trees above, this gives us:

{a b c d} {b c d} {c d} {a} {b} {c} {d}

Now we can rebuild the tree recursively: for each span, look at the rest of the list for spans that are subsets of the current span. For example, with {a b c d}, the children should be {b c d} and {a}. Here’s the code to do that.

buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

What we have here is a cleverly masked, hand-rolled implementation of the State monad. Err, well actually I suppose that I’m getting ahead of myself. Let me just explain the code as is and later I’ll show the “imperative” version. Let’s start with buildNode since it does the actual matching of {a b c d} with {b c d} {a}.

So: buildNode receives two things:

  1. The span (here called span) to find children for.
  2. A list of other spans. The spans are sorted by size, so if the first other span is not smaller than span, then buildNode can just continue on down the list.

And: buildNode returns two things:

  1. The children of span.
  2. The same list of other spans, but with span’s children removed.

Number 2 is important because we need to keep this list around. Here’s why: start off finding the children for {a b c d} in the list

{b c d} {c d} {a} {b} {c} {d}

That’s easy:

{b c d} {a}

OK, now find the children for {b c d}. In what list? Well, whatever’s left from the previous call:

{c d} {b} {c} {d}

This gives a new (children,rest) pair:

({c d} {b}, {c} {d})

And the remaining spans {c} {d} will be needed to find the children of {c d}. All right, back to the actual code in buildNode. The (children,rest) pair is (nil,nil) after buildNode runs out of possible child spans to search for.

buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

Otherwise it checks whether the next span is a child of span; that is, if it’s a subset of span. If so, then it removes that subset (with Set.difference) and conses it onto the rest of the children that buildNode has found recursively. On the other hand, if the next span isn’t a child, then buildNode recurs on the rest of the spans and conses it back on afterward.

An aside about first and second. They’re defined in Control.Arrow, which is a module I really don’t understand, but for this use, you can imagine that they are defined like this:

first f (a,b) = (f a, b)
second f (a,b) = (a, f b)

In other words, if we call first, we’re doing something to the children of span. If we call second, we’re doing something to the spans that are left to be searched. For example, the second clause of buildNode picks the head off those spans. If the head is not a subset of span, then it sticks it back on with second (next:) after the recursive call returns in the last line.

OK, now we can look at buildTree:

buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)

buildTree has the same inputs as buildNode: a span and the list of spans to search for. It delegates finding the child spans to buildNode and concentrates on building up the tree structure properly. Like buildNode, it returns a pair: (tree, rest).

The non-terminal case operates in two steps:(1) it finds the kids for span by calling buildNode and (2) it recursively builds the tree structure for the kids by calling buildTree.

(1) is simple enough:

let (kids, rest) = buildNode span ranks in

But (2) is pretty hairy, because every time buildTree is called, we have to capture the rest and pass it to the next call of buildTree. That means instead of map buildTree kids we have to use a fold.

                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)

This particular fold iterates over kids (just like map would have done), and its fold function is basically (buildTree kid : alreadyBuiltNodes) (just like map would have been). Everything has been wrapped in a tuple: the start state is not [] but ([],rest), and the fold function has to wrap the buildTree call in a call to first. It’s just kind of ugly.

But! It works! If you load the preceding code, plus the following examples, you’ll have working consensus trees.

t1 = Node "a" [Node "b" [Node "c" [Leaf "s1", Leaf "s2"], Leaf "s3"], 
               Leaf "s4"]
t2 = Node "a" [Node "b" [Leaf "s1", Node "c" [Leaf "s2", Leaf "s3"]], 
               Leaf "s4"]
t3 = Node "a" [Leaf "s1", Node "b" [Node "c" [Leaf "s2", Leaf "s3"], 
               Leaf "s4"]]
ts = [t1,t2,t3]
-- Now try:
--> consensus ts

Of course this is Haskell, A Large Language, so there is always more than one way to do it. Next time I’ll switch buildTree and buildNode to explicitly use the State monad. You can decide if you like the short lines and frequent mutation of imperative programming.

2 comments

Comment from: Learn Spanish Blog [Visitor] · http://dreaminespanol.com
Congrats on the program. Nice work!
02/07/10 @ 14:26
Comment from: Herve Leger [Visitor] · http://www.hervelegershow.com
Valuable information
30/07/10 @ 21:44

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution free blog software