| « Consensus trees implemented in Haskell, imperatively | Syntactic distance in C++ » |
An implementation of consensus trees in Haskell
I’ve used hierachical clusters for the last few years as a part of my analysis of distance measures. But they are not super reliable—small changes tend to make corpora jump from cluster to cluster. One way to fix this is to introduce noise into the data and build a consensus tree. A majority consensus tree only creates clusters that the majority of hierarchical clusters agree on.

[PDF] Example of a single hierarchical cluster dendrogram
The problem is that this is a well-understood but obscure 30-year-old concept. You know what that means: the only off-the-shelf implementations are either (1) twenty years old (2) hopelessly obfuscated variants focussing on efficiency or (3) part of some custom research code that some scientists just threw online.
Yeah, so you’re going to get (3) in this post. Actually it turns out to be pretty easy to build consensus trees, so I guess that’s why there aren’t any good implementations out there.
data Tree a = Leaf a | Node a [Tree a] deriving (Show) root tree = Node "ROOT" [Leaf "s0", tree] leaf (Leaf _) = True leaf (Node _ []) = True -- irregularities in buildNode, oh well leaf _ = False
The shenanigans start early with two representations of leaf nodes sneaking in. I should probably just standardise on the (Node _ []) representation but having a Leaf constructor around makes literals easier to type. Fighting for the side of consistency, root provides a consistent root in case NONE of the trees can agreement on a root.
consensus trees = buildTree span rest |> fst
where (span:rest) = majority (map root trees)
majority trees = trees
|> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
|> Map.unionsWith (+)
|> Map.filter (>m)
|> Map.keys
|> sortBy (comparing (negate . Set.size))
where m = floor (fromIntegral (length trees) / 2)
spans (Leaf a) = Set.singleton $ Set.singleton a
spans (Node a kids) = Set.insert (Set.unions $ Set.toList kidspans) kidspans
where kidspans = Set.unions $ map spans kids
This is the first half of the code: consensus finds the majority spans and rebuilds them into a tree. Spans are fairly simple: it’s just the leaves that a particular node dominates. For example, the green node in the first tree has the span {a b c} and the green node in the second tree has the span {b c}.


spans represents spans with sets so that it can glom all the subspans together with a big Set.unions and be sure that there aren’t any dupes. The recursive case figures the spans for all kids, then unions them together, calls that the parent’s span and adds it to the kids’ spans.
Since Haskell doesn’t have list literals, the output is kind of messy, but if you convert to lists, the output from the example tree is:
[["a"] , ["a", "b", "c"] , ["a", "b", "c", "d"] , ["b"] , ["b", "c"] , ["c"] , ["d"]]
Now if you run that on the following three trees, you’ll get three slightly different sets.



Once you combine those sets, the question is which spans are present in the majority of trees. For example, {a b c d} is present in all three trees. But {a b} is only present in tree 1, while {b c} is present in both tree 2 and tree 3. So {b c} will be part of the consensus tree (2/3 > 50%), but {a b} won’t (1/3 < 50%).
After calling spans, majority combines everything into one map via some gymnastics:
... |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
|> Map.unionsWith (+)
In other words, map each span in spans to 1. Then combine the spans into one map using (+) to combine identical entries.
Next we filter for all the spans that occur in the majority of trees and then throw away the counts (we never really cared about them anyway).
... |> Map.filter (>m)
|> Map.keys
Finally, we sort the sets by size. For the trees above, this gives us:
{a b c d} {b c d} {c d} {a} {b} {c} {d}
Now we can rebuild the tree recursively: for each span, look at the rest of the list for spans that are subsets of the current span. For example, with {a b c d}, the children should be {b c d} and {a}. Here’s the code to do that.
buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
then first (next:) (buildNode (span `Set.difference` next) rest)
else second (next:) (buildNode span rest)
What we have here is a cleverly masked, hand-rolled implementation of the State monad. Err, well actually I suppose that I’m getting ahead of myself. Let me just explain the code as is and later I’ll show the “imperative” version. Let’s start with buildNode since it does the actual matching of {a b c d} with {b c d} {a}.
So: buildNode receives two things:
And: buildNode returns two things:
Number 2 is important because we need to keep this list around. Here’s why: start off finding the children for {a b c d} in the list
{b c d} {c d} {a} {b} {c} {d}
That’s easy:
{b c d} {a}
OK, now find the children for {b c d}. In what list? Well, whatever’s left from the previous call:
{c d} {b} {c} {d}
This gives a new (children,rest) pair:
({c d} {b}, {c} {d})
And the remaining spans {c} {d} will be needed to find the children of {c d}. All right, back to the actual code in buildNode. The (children,rest) pair is (nil,nil) after buildNode runs out of possible child spans to search for.
buildNode _ [] = ([],[]) buildNode span (next:rest) = if Set.isSubsetOf next span then first (next:) (buildNode (span `Set.difference` next) rest) else second (next:) (buildNode span rest)
Otherwise it checks whether the next span is a child of span; that is, if it’s a subset of span. If so, then it removes that subset (with Set.difference) and conses it onto the rest of the children that buildNode has found recursively. On the other hand, if the next span isn’t a child, then buildNode recurs on the rest of the spans and conses it back on afterward.
An aside about first and second. They’re defined in Control.Arrow, which is a module I really don’t understand, but for this use, you can imagine that they are defined like this:
first f (a,b) = (f a, b) second f (a,b) = (a, f b)
In other words, if we call first, we’re doing something to the children of span. If we call second, we’re doing something to the spans that are left to be searched. For example, the second clause of buildNode picks the head off those spans. If the head is not a subset of span, then it sticks it back on with second (next:) after the recursive call returns in the last line.
OK, now we can look at buildTree:
buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildTree has the same inputs as buildNode: a span and the list of spans to search for. It delegates finding the child spans to buildNode and concentrates on building up the tree structure properly. Like buildNode, it returns a pair: (tree, rest).
The non-terminal case operates in two steps:(1) it finds the kids for span by calling buildNode and (2) it recursively builds the tree structure for the kids by calling buildTree.
(1) is simple enough:
let (kids, rest) = buildNode span ranks in
But (2) is pretty hairy, because every time buildTree is called, we have to capture the rest and pass it to the next call of buildTree. That means instead of map buildTree kids we have to use a fold.
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
This particular fold iterates over kids (just like map would have done), and its fold function is basically (buildTree kid : alreadyBuiltNodes) (just like map would have been). Everything has been wrapped in a tuple: the start state is not [] but ([],rest), and the fold function has to wrap the buildTree call in a call to first. It’s just kind of ugly.
But! It works! If you load the preceding code, plus the following examples, you’ll have working consensus trees.
t1 = Node "a" [Node "b" [Node "c" [Leaf "s1", Leaf "s2"], Leaf "s3"],
Leaf "s4"]
t2 = Node "a" [Node "b" [Leaf "s1", Node "c" [Leaf "s2", Leaf "s3"]],
Leaf "s4"]
t3 = Node "a" [Leaf "s1", Node "b" [Node "c" [Leaf "s2", Leaf "s3"],
Leaf "s4"]]
ts = [t1,t2,t3]
-- Now try:
--> consensus ts
Of course this is Haskell, A Large Language, so there is always more than one way to do it. Next time I’ll switch buildTree and buildNode to explicitly use the State monad. You can decide if you like the short lines and frequent mutation of imperative programming.