« Bing Maps APIConsensus trees implemented in Haskell, imperatively »

Reading and writing consensus trees in Haskell

03/03/10

Permalink 02:43:15 pm, 1007 words
Categories: Code, Linguistics

Reading and writing consensus trees in Haskell

I use the consensus tree code I wrote about the last two times in my dissertation to find agreement between clusters of Swedish dialect interviews. The input is cluster trees from R (the open source S) and the output is formatted for use by the qtree Latex package. I also need to filter out cluster trees that contain non-significant distances—all the input distances should be statistically significant.

main = do
  sigs <- withFileLines findSigs "sig-10-1000-interview.csv"
  interactFiles (withFileLines (makeConsensusTree sigs & list)) qtree

makeConsensusTree uses the list of significant files and qtree converts a Tree to a String, suitable for output to stdout. But first, findSigs:

findSigs =
  tail
  & map (replace ',' ' ' & words & tail
         & zip ["path", "trigram", "dep", "psg", "grand", "unigram", "all"]
         & filter (snd & (=="0"))
         & map fst)
  & zip ["r", "r_sq", "kl", "js", "cos"]
  & concatMap (uncurry (zip . repeat))
  & Set.fromList

(Reminder: (&) == flip (.)) findSigs is a hoky CSV parser. It drops the column heads (tail), then chops each line into fields (map (replace ',' ' ') & words & tail), assuming that they don't contain spaces so that it can use words to do the chopping.

Then it zips the hard-coded column heads back on, and throws away fields that aren't (=="0")—that is, feature sets that have more than 0 non-significant distances.

For each line, the significant feature sets are zipped with their with distance measure, and a hilarious and incomprehensible concatMap call turns them into a list of pairs like [("path","r"), ("trigram", "r"), ("path", "js") ...].

I know this is statically typed, but I don't think anybody would hesitate to call this ad-hoc mess a script. It was written in about 5 minutes for a specific file and will break as soon as that file changes.

Moving on.

makeConsensusTree sigs =
  filter (isPrefixOf "Cluster: ")
  & map rReader
  & filter (fst & (`Set.member` sigs))
  & map (snd & buildRTree)
  & consensus

makeConsensusTree is the top-level function for reading in the different dendrograms and writing out a consensus tree. Its first task is to read Yet Another Tree Format, this time R's binary dendrogram format. Each tree is on a line that starts with "Cluster: " (hence, filter (isPrefixOf "Cluster: "). Then the measure and feature set follows, and finally a list of numbers. If you divide those numbers in half (which rReader does), you get two columns.

rReader line = 
  ((measure,features),
   splitAt (floor (fromIntegral (length ns) / 2)) ns |> uncurry zip)
  where (_:measure:features:ss) = words line
        ns = map read ss

For example, if you have the line

Cluster: r all -2 1 -4 2 -4 -9 -8 3

The numbers are interpreted as the following two columns:

1.-2-3
2.1-9
3.-4-8
4.23

buildRTree further interprets these numbers as a binary tree: negative numbers are leaf nodes, (negative) indices into the interview sites or whatever the things are that you're clustering. Positive numbers are internal nodes, indices that refer to a previously created node. The index it refers to is the step at which the node was created. So, in the above example, internal node 1 has two leaf children: values 2 and 3. Internal node 2 has two children. One is node 1 and the other is a leaf (with value 9). The last node, 4, is the root. For the tiny example above, the Lisp representation of the tree is:

'(((2 3)
   9)
  (4 8))
label (Leaf a) = a
label (Node a _) = a
buildRTree tuples = last nodes
  where nodes = map buildNode tuples
        buildNode (i,j) = Node ('$':label inode++label jnode) [inode,jnode]
          where (inode,jnode) = (buildChild i, buildChild j)
        buildChild i | i < 0 = Leaf $ swediaSites !! (-i-1)
                     | otherwise = nodes !! (i-1)

The way that buildRTree implements this is kind of clever; it has to refer to previous parts of the node list as the list is being built. In Haskell, laziness is the most natural way to do this. So when buildChild implements the negative/positive → leaf/non-terminal decision, it refers to nodes for the non-terminal case. But nodes is just map buildNode and buildNode first calls buildChild and oh-no-circular-definition!

But it turns out all right, because, assuming that R has properly dumped the node indices, buildChild's i will always be less than the length of nodes at the time that it's required.

(Of course, you should always avoid (!!) and this particular use is egregiously O(n2) blah blah, but remember once again that this is a script and that I happen to know that I'll never run this on trees with more than 100 terminals. And that I'm willing to wait for up to 5 minutes.)

qtree (Leaf a) = Set.findMax a
qtree (Node a []) = Set.findMax a
qtree (Node a kids) = "[. {" ++ left ++ "} " ++ right ++ " ]"
  where (leaves,nodes) = partition leaf kids
        left = intercalate "\\\\" (map qtree leaves)
        right = unwords (map qtree nodes)

Finally, qtree converts a tree into Yet Another Tree Format, this time the qtree Latex package. I can then paste the string into my dissertation. It's pretty straightforward except for the abuse of Set.findMax—it's not finding anything since there's only one item left in the set anyway. I just don't have a way to destructure sets.

Well, now you've seen how messy script code can be in Haskell. I think it's a pretty good argument that static typing doesn't buy you much up front—I can be as messy and ugly as I want in either Python or Haskell. All you can hope for is that the type system gets out of your way while you're doing it. The win comes in six months when I decide I need to make the code more robust or (more likely) add a tiny feature. Haskell is more discoverable out of the box—no type documentation needed: the compiler figures it out for you—and more reassuring to make changes—less need for tests: many breaking changes will cause the compiler to complain.

In case you notice a tacit unwillingness to write documentation and tests for my script code, you got me. Yup. I wrote tests for some of it, but after the code was done. And documentation? I'm writing a dissertation! It's pages and pages of high-level documentation! (Just don't expect any commentary on actual code.)

1 comment

Comment from: Articles for Submit [Visitor] · http://ambravallo.com
Thanks for reminder to use hoky CSV for chopping.
26/06/10 @ 15:41

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution