Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 33 >>
Recently my advisor suggested that I compare my results to travel distance between towns instead of straight-line geographical distance. That seems like a good idea—if there’s a huge lake and a mountain in between two towns, the languages are likely to be very different. The problem is taking the time to gather all the distances; it’s not a big deal to identify 30 or so sites on Google Maps, but when you need to find the 435 distances between all 30 sites, you probably want a script to do it.
The obvious choice for this is an API to the mapping service you were going to use in the first place. For me, that led to a couple of choices:
Google Maps requires me to set up a web page (publicly accessible) and make a visible embedded map, then manipulate it with Javascript. This is really intended for mashups I think. I was disappointed because I prefer to let my knowledge of HTML and Javascript rot and moulder. (I found some discussion that indicates there is no SOAP API, although it’s still possible that I just missed it.)
In contrast, the 3rd Google result for “google maps soap api” is for Bing Maps, admittedly under the previous branding “Virtual Earth". The SOAP API that allows you to pretend (given a Proper Library) that you are calling a function that magically gives distances between two points. The Proper Library manages all the XML conversion/parsing and sending over the Internet. So Bing Maps won this one, because I just want to import a library and call a function from a script instead of embedding the whole thing in a web page. This led to a second choice:
I momentarily considered Haskell, because it does have a lot of libraries listed in Hackage. But it’s pretty weak on the XML front and its lone SOAP library is sort of just a text wrapper around XML that you have to build yourself. No XML conversion or parsing, just the sending-over-Internet part.
From there I went to Python. I quickly found that Dive Into Python has a chapter on SOAP, but it’s 6 or 7 years out of date, and SOAPpy, the library it uses, to has been replaced by ZSI, “Zolera Soap Infrastructure". I distrust these companies that start with Z (*cof* Zope *cof* Zend) and the source was package in a .egg (or maybe a .muffin? no it was definitely .egg) which I don’t even know how to install and it was all just Fear-Uncertainty-Doubt.
The real reason that I wasn’t enthusiastic about installing a Python library is that (1) I probably will never use SOAP again and (2) in the Bing Maps documentation, it mentions that Visual Studio has SOAP libraries installed with it by default. And all the examples are in C#, so it becomes a matter of cut-and-paste. And Visual Studio has a wizard for generating the proxy objects you need in a static language. (Imagine that, Visual Studio having a wizard!) Well, the wizard was the last straw—I mean who can resist all that code the wizards generate for you? Plus it was getting late and I just wanted to get something working. I even resisted the temptation to write the code in F#, because F# is extremely short on wizards right now.
(The code would have been a lot better in F#, though. It has tuples and pattern matching and it’s hard to write a program these days that doesn’t use one or both of those features. You’ll see when I show you the code.)
So let’s see. Here’s the gist of my workflow. I went to Bing Maps and identified all 30 of the interview sites. Then I dumped to some format called KML. Now, Bing Maps doesn’t make it obvious how to give somebody a link to these sites, but the option to dump to KML is front and centre. I suspect that it started life as a desktop product. Anyway, it’s annoying for directions but it’s perfect for my purposes. (Hint: click the icon of a letter in the bottom-left, even if you just want a link to the map.)
I did grep coordinates swedia.kml >pbcopy and pasted that in to an emacs buffer. From there I recorded some macros to produce a Python literal of a list of pairs. With the list loaded in the REPL, I added on the location names to each location, something like this:
[(site,lat,long) for (site,(lat,long)) in zip(consts.swediaSites, coords)]
At this point I had a list in Python whose elements looked like this:
("Ankarsrum", 16.33, 57.70)
Then I pasted the list back into the emacs buffer and added some noise around each line to make it valid C#:
sites.Add(new Loc("Ankarsrum", 16.33, 57.70));
I should have just written a string format comprehension in Python but I was really tired by this point.
This, of course assumes a struct named Loc and a List<Loc> called sites. But that’s easy enough to do in C#. Pardon my outmoded accent—the machines I coded for at Bing had not yet updated to .NET 3.5 so I didn’t really learn a lot of 3.5 conveniences fluently.
struct Loc
{
private string _site;
private double _lat;
private double _long_;
public String Site { get { return _site; } }
public double Lat { get { return _lat; } }
public double Long { get { return _long_; } }
public Loc(String site, double lat, double long_)
{
_site = site;
_lat = lat;
_long_ = long_;
}
}
class Program
{
private static List<Loc> sites = new List<Loc>(30);
private static void init()
{
sites.Add(new Loc("Ankarsrum", 16.331280825605035, 57.700431990646344));
// ... 29 more of these ...
}
OK, so that’s the boring setup. Then I needed a pairwise iteration over a list. This is a lot easier with linked lists and type inference—managing indices is fiddly and the type annotations get unwieldy so I said Forget It and wrote some C-like code instead. I didn’t want to fiddle with it to get it right. (But it is pretty ugly).
private static IEnumerable<int[]> pairwise(int n)
{
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
yield return new int[] { i, j };
}
}
}
Finally, the code in Main is not all that long as badly written C# goes:
static void Main(string[] args)
{
init();
var req = new BingMaps.RouteRequest();
req.Credentials = new BingMaps.Credentials();
req.Credentials.ApplicationId =
"Very-Long-String-You-Get-By-Registering-With-Bing";
BingMaps.Waypoint[] route = new BingMaps.Waypoint[2];
req.Waypoints = route;
var client = new BingMaps.RouteServiceClient(
"BasicHttpBinding_IRouteService");
foreach (int[] indices in pairwise(30))
{
var from = sites[indices[0]];
var to = sites[indices[1]];
route[0] = new BingMaps.Waypoint()
{
Description = from.Site,
Location = new BingMaps.Location() {
Latitude = from.Long, Longitude = from.Lat
}
};
route[1] = new BingMaps.Waypoint()
{
Description = to.Site,
Location = new BingMaps.Location() {
Latitude = to.Long, Longitude = to.Lat
}
};
var d = client.CalculateRoute(req).Result.Summary.Distance;
Console.WriteLine(string.Format("(\"{0}\", \"{1}\"): {2},",
from.Site, to.Site, d));
}
}
I added all the type inference it could handle and shortened some of the ridiculous example names. The API is still pretty verbose but I think that’s because it’s SOAP. Let’s see, you have to
If you specify two points that do not have a connecting route, then you get a hilarious null pointer exception somewhere along the chain CalculateRoute().Result.Summary.Distance. It seems to me that this should be an exception, but no, just null.
Really, though, the only reason that I got null pointer exceptions is that KML stores its points in Longitude, Latitude order, while the rest of the world uses Latitude, Longitude order. So instead of two Swedish villages, I requested the distance between two points in the Arab Sea. The alert reader will notice the switcheroo I have to pull to make this work. I should have changed the Loc constructor but I didn’t think of that until just now.
The code takes about 100 seconds to get all 435 distances. It’s faster than I expected, so I guess most of the lag in the web interface is the browser drawing the route.
In conclusion, this probably won’t be useful to anybody else, but at least maybe you’ll think it’s cool to see how you can easily get driving distances between all unique pairs of sites.
I use the consensus tree code I wrote about the last two times in my dissertation to find agreement between clusters of Swedish dialect interviews. The input is cluster trees from R (the open source S) and the output is formatted for use by the qtree Latex package. I also need to filter out cluster trees that contain non-significant distances—all the input distances should be statistically significant.
main = do sigs <- withFileLines findSigs "sig-10-1000-interview.csv" interactFiles (withFileLines (makeConsensusTree sigs & list)) qtree
makeConsensusTree uses the list of significant files and qtree converts a Tree to a String, suitable for output to stdout. But first, findSigs:
findSigs =
tail
& map (replace ',' ' ' & words & tail
& zip ["path", "trigram", "dep", "psg", "grand", "unigram", "all"]
& filter (snd & (=="0"))
& map fst)
& zip ["r", "r_sq", "kl", "js", "cos"]
& concatMap (uncurry (zip . repeat))
& Set.fromList
(Reminder: (&) == flip (.)) findSigs is a hoky CSV parser. It drops the column heads (tail), then chops each line into fields (map (replace ',' ' ') & words & tail), assuming that they don't contain spaces so that it can use words to do the chopping.
Then it zips the hard-coded column heads back on, and throws away fields that aren't (=="0")—that is, feature sets that have more than 0 non-significant distances.
For each line, the significant feature sets are zipped with their with distance measure, and a hilarious and incomprehensible concatMap call turns them into a list of pairs like [("path","r"), ("trigram", "r"), ("path", "js") ...].
I know this is statically typed, but I don't think anybody would hesitate to call this ad-hoc mess a script. It was written in about 5 minutes for a specific file and will break as soon as that file changes.
Moving on.
makeConsensusTree sigs = filter (isPrefixOf "Cluster: ") & map rReader & filter (fst & (`Set.member` sigs)) & map (snd & buildRTree) & consensus
makeConsensusTree is the top-level function for reading in the different dendrograms and writing out a consensus tree. Its first task is to read Yet Another Tree Format, this time R's binary dendrogram format. Each tree is on a line that starts with "Cluster: " (hence, filter (isPrefixOf "Cluster: "). Then the measure and feature set follows, and finally a list of numbers. If you divide those numbers in half (which rReader does), you get two columns.
rReader line =
((measure,features),
splitAt (floor (fromIntegral (length ns) / 2)) ns |> uncurry zip)
where (_:measure:features:ss) = words line
ns = map read ss
For example, if you have the line
Cluster: r all -2 1 -4 2 -4 -9 -8 3
The numbers are interpreted as the following two columns:
| 1. | -2 | -3 |
| 2. | 1 | -9 |
| 3. | -4 | -8 |
| 4. | 2 | 3 |
buildRTree further interprets these numbers as a binary tree: negative numbers are leaf nodes, (negative) indices into the interview sites or whatever the things are that you're clustering. Positive numbers are internal nodes, indices that refer to a previously created node. The index it refers to is the step at which the node was created. So, in the above example, internal node 1 has two leaf children: values 2 and 3. Internal node 2 has two children. One is node 1 and the other is a leaf (with value 9). The last node, 4, is the root. For the tiny example above, the Lisp representation of the tree is:
'(((2 3) 9) (4 8))
label (Leaf a) = a
label (Node a _) = a
buildRTree tuples = last nodes
where nodes = map buildNode tuples
buildNode (i,j) = Node ('$':label inode++label jnode) [inode,jnode]
where (inode,jnode) = (buildChild i, buildChild j)
buildChild i | i < 0 = Leaf $ swediaSites !! (-i-1)
| otherwise = nodes !! (i-1)
The way that buildRTree implements this is kind of clever; it has to refer to previous parts of the node list as the list is being built. In Haskell, laziness is the most natural way to do this. So when buildChild implements the negative/positive → leaf/non-terminal decision, it refers to nodes for the non-terminal case. But nodes is just map buildNode and buildNode first calls buildChild and oh-no-circular-definition!
But it turns out all right, because, assuming that R has properly dumped the node indices, buildChild's i will always be less than the length of nodes at the time that it's required.
(Of course, you should always avoid (!!) and this particular use is egregiously O(n2) blah blah, but remember once again that this is a script and that I happen to know that I'll never run this on trees with more than 100 terminals. And that I'm willing to wait for up to 5 minutes.)
qtree (Leaf a) = Set.findMax a
qtree (Node a []) = Set.findMax a
qtree (Node a kids) = "[. {" ++ left ++ "} " ++ right ++ " ]"
where (leaves,nodes) = partition leaf kids
left = intercalate "\\\\" (map qtree leaves)
right = unwords (map qtree nodes)
Finally, qtree converts a tree into Yet Another Tree Format, this time the qtree Latex package. I can then paste the string into my dissertation. It's pretty straightforward except for the abuse of Set.findMax—it's not finding anything since there's only one item left in the set anyway. I just don't have a way to destructure sets.
Well, now you've seen how messy script code can be in Haskell. I think it's a pretty good argument that static typing doesn't buy you much up front—I can be as messy and ugly as I want in either Python or Haskell. All you can hope for is that the type system gets out of your way while you're doing it. The win comes in six months when I decide I need to make the code more robust or (more likely) add a tiny feature. Haskell is more discoverable out of the box—no type documentation needed: the compiler figures it out for you—and more reassuring to make changes—less need for tests: many breaking changes will cause the compiler to complain.
In case you notice a tacit unwillingness to write documentation and tests for my script code, you got me. Yup. I wrote tests for some of it, but after the code was done. And documentation? I'm writing a dissertation! It's pages and pages of high-level documentation! (Just don't expect any commentary on actual code.)
Last time I showed a functional implementation of consensus trees that ended up having to pass state around. It was kind of ugly. Fortunately, this pattern is common in Haskell, so there is a wrapper for it in the form of the State monad.
If you haven’t used the State monad before, it gives you one (1) variable—it may not sound like much, but the number was zero (0) before. All state-changing functions implicitly operate on this variable. This makes your code look a lot like Perl code (where the one variable is named $_, if you ever bother naming it). I apologise in advance if this gives you fever or convulsions.
The only code that changes is consensus, buildNode and buildTree, so you’ll have to refer to the previous post for the rest of the code. Here are the functional originals:
consensus trees = majority (map root trees) |> buildTree |> fst
buildTree [span] = (Leaf span, [])
buildTree (span:ranks) = let (kids, rest) = buildNode span ranks in
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree (span:rest))
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
then first (next:) (buildNode (span `Set.difference` next) rest)
else second (next:) (buildNode span rest)
Before changing this code to use the State monad, let’s get started with some utility code. I think it’ll help you understand State if you haven’t worked with it before.
isStackEmpty = return . (==[]) =<< get peek = return . head =<< get pop = do first:rest <- get put rest return first push x = put . (x:) =<< get ifM cond seq alt = do b <- cond if b then seq else alt
If you remember from last time, we were passing around a tuple. The second element of the tuple always holds the remaining spans that are potential children for the current span. Here, that list of spans is stuffed into the State monad. Then, the State accessor functions like get and put let you mutate this list.
Internally, I believe that State is doing exactly the same tuple-passing trick as the original code. At least, that’s the way it’s always been explained to me. I suppose it could be creating a Real Mutable Cell for me. It would come out to the same thing.
Anyway, isStackEmpty is representative: get the span list, check to see if it’s empty, and return. peek, push and pop are variants on the same thing.
ifM just embeds if in a monad. Not too much to say about it.
OK, let’s port that code to the State monad, starting from the top down. consensus doesn’t change much. It runs evalState instead of fst, and that’s about it. evalState takes two arguments: a function to run in the State monad, and the initial value of the state variable. It returns the value of the function and throws away the state variable. (Conversely, execState throws away the value of the function and returns the state variable.)
consensus trees = evalState (buildTree span) ranks where (span:ranks) = majority (map root trees)
Next up is buildTree:
buildTree span =
ifM isStackEmpty
(return (Leaf span))
(do
kids <- buildNode span
nodes <- mapM buildTree kids
return (Node span nodes))
In buildTree, pattern matching goes away but the intent of isStackEmpty is the same. The true branch is also basically the same. However, the false branch is markedly simpler and the order of evaluation is clear: first find the current span’s kids, then build the nodes for each of the kids, then put them together in a new Node. All the shenanigans with first and foldl go away, because mapM gets the State monad to do all the tuple mangling in the background.
Of course, if you like putting everything on one line (like me), you can do that too…after all, do is just sugar for a bunch of higher order functions.
buildTree span =
ifM isStackEmpty
(return (Leaf span))
((return . Node span <=< mapM buildTree) =<< buildNode span)
Note that, as ugly as the one-liner is, if it weren’t monadic, it would be
(return . Node span . mapM buildTree) $ buildNode span
Haskell just forces us to mark compose and apply differently if they’re monadic or not. ((.) vs (<=<) and ($) vs (=<<)). This means that you have to learn two versions of some functions and know when to use them. On the plus side, you can be very sure where you are accessing state.
Whether this is important is a philosophical/cultural question more than a linguistic one: we’re edging into Sapir-Whorf territory here. For comparison, when writing C, I always know when I’m referring to something by reference versus by value because I have to very carefully mark pointers. There are even two syntaxes for attribute access (x.y vs p->y). Again, whether this is important should be decided mostly by your problem—it’s not possible to say whether marking by-reference is universally more important than marking state changes.
OK, end of digression. Let’s see how buildNode changes:
buildNode span =
ifM isStackEmpty
(return [])
(ifM (return . (`Set.isSubsetOf` span) =<< peek)
(do
rank <- pop
span' <- buildNode (span `Set.difference` rank)
return (rank:span'))
(do
rank <- pop
span' <- buildNode span
push rank
return span'))
Again, the empty check and terminal case are pretty much the same. However, the subset check is more complicated than before because of more monadic composition.
As with buildTree, the order of events is much clearer. Especially in the non-subset case, it’s much clearer that you pop off the current span, look at the rest, then push it back on when the recursive call returns. The difference between the two cases is maybe not quite as clear since they don’t sit on adjacent lines.
And of course you can write more of buildNode on one line too*:
buildNode span =
ifM isStackEmpty
(return [])
(ifM (return . (`Set.isSubsetOf` span) =<< peek)
(liftM2 (:) peek (buildNode . Set.difference span =<< pop))
(do
rank <- pop
span' <- buildNode span
push rank
return span'))
So that’s the imperative version. It’s easier to read, but I had more trouble thinking about how to write it. It was easier to stop and think about the types when I was manually passing around the state. Also, the type errors while using State seemed pretty hard to understand. Probably I haven’t used it enough to get used to them.
Next time, the stupid script wrapper that goes around the whole thing. Not particularly impressive, but I haven’t blogged any of my small Haskell “scripts” that do actual file-to-file transformations. Usually pretty simple, but just complicated enough that I want to use Haskell instead of Python.
*If you saw cond in there, you’re not alone:
t = return True
condM [] = error "Should have used t"
condM ((test,action):conditions) = ifM test action (condM conditions)
buildNode' span =
condM [(isStackEmpty, return [])
,(return . (`Set.isSubsetOf` span) =<< peek,
(liftM2 (:) peek (buildNode' . Set.difference span =<< pop)))
,(t, do
rank <- pop
span' <- buildNode' span
push rank
return span')]
An implementation of consensus trees in Haskell
I’ve used hierachical clusters for the last few years as a part of my analysis of distance measures. But they are not super reliable—small changes tend to make corpora jump from cluster to cluster. One way to fix this is to introduce noise into the data and build a consensus tree. A majority consensus tree only creates clusters that the majority of hierarchical clusters agree on.

[PDF] Example of a single hierarchical cluster dendrogram
The problem is that this is a well-understood but obscure 30-year-old concept. You know what that means: the only off-the-shelf implementations are either (1) twenty years old (2) hopelessly obfuscated variants focussing on efficiency or (3) part of some custom research code that some scientists just threw online.
Yeah, so you’re going to get (3) in this post. Actually it turns out to be pretty easy to build consensus trees, so I guess that’s why there aren’t any good implementations out there.
data Tree a = Leaf a | Node a [Tree a] deriving (Show) root tree = Node "ROOT" [Leaf "s0", tree] leaf (Leaf _) = True leaf (Node _ []) = True -- irregularities in buildNode, oh well leaf _ = False
The shenanigans start early with two representations of leaf nodes sneaking in. I should probably just standardise on the (Node _ []) representation but having a Leaf constructor around makes literals easier to type. Fighting for the side of consistency, root provides a consistent root in case NONE of the trees can agreement on a root.
consensus trees = buildTree span rest |> fst
where (span:rest) = majority (map root trees)
majority trees = trees
|> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
|> Map.unionsWith (+)
|> Map.filter (>m)
|> Map.keys
|> sortBy (comparing (negate . Set.size))
where m = floor (fromIntegral (length trees) / 2)
spans (Leaf a) = Set.singleton $ Set.singleton a
spans (Node a kids) = Set.insert (Set.unions $ Set.toList kidspans) kidspans
where kidspans = Set.unions $ map spans kids
This is the first half of the code: consensus finds the majority spans and rebuilds them into a tree. Spans are fairly simple: it’s just the leaves that a particular node dominates. For example, the green node in the first tree has the span {a b c} and the green node in the second tree has the span {b c}.


spans represents spans with sets so that it can glom all the subspans together with a big Set.unions and be sure that there aren’t any dupes. The recursive case figures the spans for all kids, then unions them together, calls that the parent’s span and adds it to the kids’ spans.
Since Haskell doesn’t have list literals, the output is kind of messy, but if you convert to lists, the output from the example tree is:
[["a"] , ["a", "b", "c"] , ["a", "b", "c", "d"] , ["b"] , ["b", "c"] , ["c"] , ["d"]]
Now if you run that on the following three trees, you’ll get three slightly different sets.



Once you combine those sets, the question is which spans are present in the majority of trees. For example, {a b c d} is present in all three trees. But {a b} is only present in tree 1, while {b c} is present in both tree 2 and tree 3. So {b c} will be part of the consensus tree (2/3 > 50%), but {a b} won’t (1/3 < 50%).
After calling spans, majority combines everything into one map via some gymnastics:
... |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
|> Map.unionsWith (+)
In other words, map each span in spans to 1. Then combine the spans into one map using (+) to combine identical entries.
Next we filter for all the spans that occur in the majority of trees and then throw away the counts (we never really cared about them anyway).
... |> Map.filter (>m)
|> Map.keys
Finally, we sort the sets by size. For the trees above, this gives us:
{a b c d} {b c d} {c d} {a} {b} {c} {d}
Now we can rebuild the tree recursively: for each span, look at the rest of the list for spans that are subsets of the current span. For example, with {a b c d}, the children should be {b c d} and {a}. Here’s the code to do that.
buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
then first (next:) (buildNode (span `Set.difference` next) rest)
else second (next:) (buildNode span rest)
What we have here is a cleverly masked, hand-rolled implementation of the State monad. Err, well actually I suppose that I’m getting ahead of myself. Let me just explain the code as is and later I’ll show the “imperative” version. Let’s start with buildNode since it does the actual matching of {a b c d} with {b c d} {a}.
So: buildNode receives two things:
And: buildNode returns two things:
Number 2 is important because we need to keep this list around. Here’s why: start off finding the children for {a b c d} in the list
{b c d} {c d} {a} {b} {c} {d}
That’s easy:
{b c d} {a}
OK, now find the children for {b c d}. In what list? Well, whatever’s left from the previous call:
{c d} {b} {c} {d}
This gives a new (children,rest) pair:
({c d} {b}, {c} {d})
And the remaining spans {c} {d} will be needed to find the children of {c d}. All right, back to the actual code in buildNode. The (children,rest) pair is (nil,nil) after buildNode runs out of possible child spans to search for.
buildNode _ [] = ([],[]) buildNode span (next:rest) = if Set.isSubsetOf next span then first (next:) (buildNode (span `Set.difference` next) rest) else second (next:) (buildNode span rest)
Otherwise it checks whether the next span is a child of span; that is, if it’s a subset of span. If so, then it removes that subset (with Set.difference) and conses it onto the rest of the children that buildNode has found recursively. On the other hand, if the next span isn’t a child, then buildNode recurs on the rest of the spans and conses it back on afterward.
An aside about first and second. They’re defined in Control.Arrow, which is a module I really don’t understand, but for this use, you can imagine that they are defined like this:
first f (a,b) = (f a, b) second f (a,b) = (a, f b)
In other words, if we call first, we’re doing something to the children of span. If we call second, we’re doing something to the spans that are left to be searched. For example, the second clause of buildNode picks the head off those spans. If the head is not a subset of span, then it sticks it back on with second (next:) after the recursive call returns in the last line.
OK, now we can look at buildTree:
buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildTree has the same inputs as buildNode: a span and the list of spans to search for. It delegates finding the child spans to buildNode and concentrates on building up the tree structure properly. Like buildNode, it returns a pair: (tree, rest).
The non-terminal case operates in two steps:(1) it finds the kids for span by calling buildNode and (2) it recursively builds the tree structure for the kids by calling buildTree.
(1) is simple enough:
let (kids, rest) = buildNode span ranks in
But (2) is pretty hairy, because every time buildTree is called, we have to capture the rest and pass it to the next call of buildTree. That means instead of map buildTree kids we have to use a fold.
first (Node span) (foldl buildKids ([],rest) kids)
where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
This particular fold iterates over kids (just like map would have done), and its fold function is basically (buildTree kid : alreadyBuiltNodes) (just like map would have been). Everything has been wrapped in a tuple: the start state is not [] but ([],rest), and the fold function has to wrap the buildTree call in a call to first. It’s just kind of ugly.
But! It works! If you load the preceding code, plus the following examples, you’ll have working consensus trees.
t1 = Node "a" [Node "b" [Node "c" [Leaf "s1", Leaf "s2"], Leaf "s3"],
Leaf "s4"]
t2 = Node "a" [Node "b" [Leaf "s1", Node "c" [Leaf "s2", Leaf "s3"]],
Leaf "s4"]
t3 = Node "a" [Leaf "s1", Node "b" [Node "c" [Leaf "s2", Leaf "s3"],
Leaf "s4"]]
ts = [t1,t2,t3]
-- Now try:
--> consensus ts
Of course this is Haskell, A Large Language, so there is always more than one way to do it. Next time I’ll switch buildTree and buildNode to explicitly use the State monad. You can decide if you like the short lines and frequent mutation of imperative programming.
I believe in the Golden Rule, which is (for those not current on the first gospel) "do for others what you'd like them to do for you". And I like to read code. But it's not really fair to ask to see other people's code without showing mine too. So here's my dissertation code. Well, the part that's unique to my dissertation at least. And minus all the Haskell code that does all the pre-, post- and interstitial processing, and the Python glue that runs it all.
Some caveats: this is fairly simple code, but it's surrounded by a lot of infrastructure. It has to work fast and elegance is not the first priority. Oh, did I mention that it's in C++? So elegance CAN'T BE the first priority. I'm not a native C++er so the code reads a lot like the Caml original. I kind of wish I'd stuck with Caml, but at the time I was trying to reduce memory use at all costs.
Note: The source files don't mention it, but the code is available under the BSD licence.
The last caveat is that if you want to know the purpose of this code, you really need to read one of my papers or John Nerbonne and Wybo Wiersma's papers. I will try to explain along the way, but no guarantees about the explanation's quality. I apologise ahead of time.
OK, so the core code is in "icecore.h". For each of the three programs, I have three small cpp files that #include "icecore.h" and each do something slightly different. It's not the greatest setup—I suppose I should have the implementations separate from the .h in icecore.cpp—but it's simple and works well with my Python wrapper.
#include <string> #include <list> #include <ext/hash_map> #include <iostream> #include <fstream> #include <cmath> #include "params.h" #define SIGNIFICANCE ITERATIONS / 20
The only interesting part of this code is "params.h". My Python script writes a new version of params.h for each variant of measure/features/iterations/etc. ITERATIONS is one of those parameters, so then SIGNIFICANCE is ITERATIONS / 20. It bothers me for constants to randomly pop up in my code, un-namespaced or listed in an import statement, but this seems to be a natural part of life in C/C++. Experts can correct me if I'm wrong—maybe C++'s practise is to stick constants in their own namespace.
namespace __gnu_cxx { // culled from the internet.
template<> // just wraps hash<char *>, which is entirely logical
struct hash<std::string> { // and should have already been done
hash<char *> h;
size_t operator()(const std::string& s) const {
return h(s.c_str());
}
};
}
Actually, one of those includes hides another tricky thing: including a hash map instead of a tree map. Yeah, I know you were thinking "C++, that's an imperative language. The STL will provide a hash map implementation!". To you I say HA. HA HA. Of course not! The STL provides a TREE map, which has log(n) insertion and lookup and is appropriate for use in functional code. That makes it pretty out of place in C++.
So now, yeah, I can tell you are thinking "GCC, it provides a hash map implementation. I bet it conveniently interoperates with all the STL types." To you I now shake my head with tears. Tears of failed compilations. No, it turns out that you have to manually specialise the hash to work with strings. (you know, those things that you ALWAYS use as hash keys) Strings don't work out of the box. Fortunately there are a lot of people using C++, so it's fairly simple to copy/paste from other blogs.
using namespace std;
using namespace __gnu_cxx;
/* Ara */
double sum(const vector<double>& a) {
double total = 0.0;
for(int i=0; i < a.size(); i++)
total += a[i];
return total;
}
/* Hash */
hash_map<string,double, hash<string> > count (const vector<string>& a) {
hash_map<string,double,hash<string> > h;
for(int i=0; i < a.size(); i++) {
h[a[i]]++;
}
return h;
}
Now for a couple of utilities. sum is straightforward. count is what I now call histogram: count how many times each string happens in a list. It's the core of the program really: at its heart, the statistical classifier basically counts how many times the string X happens in dialect 1 and compares how many times X happens in dialect 2. Everything else is just extra to make sure the counts are actually meaningful.
Those "Ara" and "Hash" comments are remnants from the much larger utility libraries I wrote in Caml. All that stuff went away as I simplified for my limited knowledge of C++ (also known as: "I hate #include").
/* random classes from Stroustrup's 3rd edition */
class Randint {
unsigned long randx;
public:
Randint(long s = 0) { randx=s; }
void seed(long s) { randx=s; }
//magic numbers chosen to use 31 bits of a 32-bit long
static long abs(long x) {return x & 0x7fffffff; }
static double max() { return 2147483648.0; } // note: a double
long draw() { return randx = randx * 1103515245 + 12345; }
double fdraw() { return abs(draw()) / max(); } // in the interval [0,1]
long operator()() { return abs(draw()); } // in the interval [0,2**31]
};
class Urand : public Randint { // uniform distro in the interval [0:n[
public:
Urand(long s = 0) : Randint(s) { }
long next(long n) { long r = long(n*fdraw()); return (r==n) ? n-1: r; }
};
template <class a> a choice(const vector<a>& l) {
static Urand uni((unsigned)time(0));
return l[uni.next(l.size())];
}
So yeah, it turns out that C++ also doesn't have a reliable way to generate pseudo-random integers in a certain range. Fortunately, Bjarne Stroustrup provides a couple of classes to do exactly that in The C++ Programming Language. The code looks a little too simple to be real, but, from what I've read, this does appear to implement one of the basic pseudo-random number generators.
typedef vector<string> strings; // I can't get these two to use const string
typedef vector<vector<string> > dialect;
typedef hash_map<string, double, hash<string> > entry;
typedef hash_map<const string, pair<double,double>, hash<string> > sample; // but this is fine. ??
typedef hash_map<const string, vector<pair<double,double> >, hash<string> >
samples;
typedef vector<pair<double,double> > totals;
template <class T, class U> T fst(pair<T,U> p) { return p.first; }
template <class T, class U> U snd(pair<T,U> p) { return p.second; }
template <class T> vector<T> concat(const vector<vector<T> >& as) {
vector<T> acc;
for(int i=0; i < as.size(); i++) {
acc.insert(acc.end(), as[i].begin(), as[i].end());
}
return acc;
}
Also I wrote fst, snd and concat. Told you this was a Caml port.
// = concat (maptimes (fun () -> choice dialect) 1000)
template <class T> vector<T> permutation(const vector<vector<T> >& dialect) {
vector<T> acc;
for(int i = 0; i < SAMPLES; i++) {
vector<T> tmp = choice(dialect);
acc.insert(acc.end(), tmp.begin(), tmp.end());
}
return acc;
}
Speaking of Caml, permutation's leading comment was the Caml code. Ahh, those were the days. Anyway, permutation is a bad name: it's actually a 1000-sentence sample from a corpus. A corpus is a vector of sentences, which are themselves vectors of strings, hence the type: vector<vector<T> >. I don't know why I didn't just use the typedef names. Wait, that's because I wanted it to be generic, which the typedefs are not. I don't know why I wanted that. No sense in trying to reuse code around here—it's a lost cause.
Anyway, choice(dialect) chooses a single sentence randomly. The words of the sentence are then inserted into the accumulator. (You can see why I didn't call it sample: I had already used it as a typedef: zip_ref below returns a sample.)
sample zip_ref (entry& h, entry& h2) {
// I can't get these to be const entry&; apparently there is some
// trouble with mixing iterator types in the h2[i->first] and h[i->first]
sample h3;
for(entry::const_iterator i = h.begin(); i!=h.end(); i++) {
h3[i->first] = make_pair(i->second, h2[i->first]);
}
for(entry::const_iterator i = h2.begin(); i!=h2.end(); i++) {
entry::const_iterator loc = h.find(i->first);
if(loc==h.end()) {
h3[i->first] = make_pair(h[i->first], i->second);
}
}
return h3;
}
I don't understand C++'s type system. I can't pass const entry& h, but I can get a const_iterator from it.
Basically zip_ref combines two hash maps created by count (remember, that's a map from string to int), returning a map from string to pair of int,int (which is a sample). The first loop fills in all the values that are present in at least h. The second loop fills in all the values that are present only in h2. It checks to make sure that it only fills in values that don't exist in h. That's what comparing h.find(i->first)==h.end() does. But this is not necessary because C++'s hash_map::operator[] uses the int default constructor if the value is not in h. In fact, I rely on that inside the if, so all the if does is save some redundant overwrites.
Hmf. This probably isn't worth it, because I have to copy the whole hash anyway when I return from zip_ref. I should really get rid of the whole check inside the second loop. As I try to remember why I did it this way, I'm pretty sure I can blame this on the Caml original too. C++ has default constructors, which Caml doesn't.
sample normalise(const strings& a, const strings& b,
int iterations=5, size_t total_types=0) {
entry tmp1 = count(a); // Because zip_ref takes entry&, not const entry&,
entry tmp2 = count(b); // I have to provide real l-values to zip_ref
sample ab = zip_ref(tmp1, tmp2); // that's what Bjarne Stroustrup said anyway
double tokens_a = a.size();
double tokens_b = b.size();
double tokens = tokens_a + tokens_b;
double types = tmp1.size() + tmp2.size();
double ci, fa, fb, f;
for(int i = 0; i < iterations; i++) {
for(sample::iterator i = ab.begin(); i!=ab.end(); i++) {
ci = i->second.first + i->second.second;
fa = i->second.first / len_a;
fb = i->second.second / len_b;
f = fa + fb;
i->second.first = (ci * fa / f) * 2 * types / tokens;
i->second.second = (ci * fb / f) * 2 * types / tokens;
}
}
return ab;
}
Normalise tries to undo some common skewing of the data so that the really interesting features are not drowned out by a few uninteresting-but-common features. It normalises for two properties.
The first is for sentence length: when you sample 1000 sentences, some sentences are longer than others. The corpus with longer sentences will have higher features counts just because it has more words per sentence. To get around this, each feature is scaled by its total corpus size:
// 0. Find total count ci = i->second.first + i->second.second; // 1. Convert to frequency fa = i->second.first / tokens_a; fb = i->second.second / tokens_b; // 2. Find total frequency f = fa + fb // 3. Scale by total count and total frequency cia = fa * ci / f cib = fb * ci / f
For example, say you have a feature f that happens 8 times in corpus A and 10 times in corpus B. "Oh ho!" you say (if you are pretending to be overbearingly English), "Oh ho! f is 0.8 times as important in A as it is in B." (Being careful to italicise the f, which is difficult in speech but barely possible if you try hard enough).
But wait! What if you knew that corpus A only had 40 words while corpus B had 100? Now that 8 looks a lot more important. So it's better to use frequencies: 8 / 40 = 0.2 and 10 / 100 = 0.1. It turns out that f is twice as important in A as it is in B. But frequencies will mess up later steps of the normalisation, so let's share the original ci = 10 + 8 occurences between them in proportion to their important. So cia = (0.2 / 0.3) * 18 = 12 while cib = (0.1 / 0.3) * 18 = 6.
The other normalisation is for corpus variety. If a corpus has a lot of variety, it will have a lot of features that only occur a few times. On the other hand, if a corpus is fairly boring, it will have a few features that occur a lot. This normalisation pushes up the rare-but-interesting features from the variety corpus so they don't get drowned out by the common features from the boring corpus.
To do this, it calculates three values:
// 1. The number of features double tokens = tokens_a + tokens_b; // 2. The number of types of features in both A and B double types = tmp1.size() + tmp2.size(); // 3. The total scaled count cia *= 2 * types / tokens; cib *= 2 * types / tokens;
The best way to think about this is that we are finding the average count of tokens per type, then scaling each token to find out how many times average it occurs. This scaling should make distances between sites depend more on the interesting features: the ones that occur more than average, instead of just the ones that occur more.
Going back to the previous example, imagine that corpus A (from the previous example) has only 5 types of features for its 40 total features. A pretty boring corpus—probably somebody talking about the weather for minutes on end. In contrast, let corpus B have 30 types for its 100 total features. This is an articulate interviewee, capable of using all types of crazy grammatical constructions in a short time.
This means together, corpus A and B have a combined variety of 2 tokens per type. (2 = (100 + 40) / ((30+5) * 2) = 140 / 35 * 2 = 140 / 70). The extra division by 2 comes in because 100+40 and 30+5 are the averages for both corpora at once. To get the average for just one, divide by two.
So, since the combined average count is 2 tokens per type, a feature type that happens 12 times is 6 times more important than average. So, in corpus A, f will be 6 = 12 / 2 = 12 / ((100+40) / ((30+5) * 2)) = ... Likewise, f becomes 3 in corpus B.
Anyway, math explanation aside, the structure of normalise is pretty bad—it handles all the conversion of raw features into fully counted maps AND does all the normalisation. The problem is that the normalisation needs lots of information, like the number of features, the number of feature types, etc. So normalise has way to much responsibility because it needs to know too much. (And I don't want to make a separate struct to hold a bunch of info for passing around.)
Now for the specific distance measures. They're pretty simple. Like, so simple that I've copy-pasted new ones from the old ones and got them to work without a single compile error. I know. That never happens, especially in C++. The reason that they're so simple is that once the sample is created and normalised, all that's really left to do is sum in some fancy way. (The simplest, R is just sum . (map (abs . (-))).)
double r(const sample& c) {
double total = 0.0;
for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
total += abs(i->second.first - i->second.second);
}
return total;
}
inline double sqr(double d) { // or std::pow in cmath
return d*d;
}
double r_sq(const sample& c) {
double total = 0.0;
for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
total += sqr(i->second.first - i->second.second);
}
return total;
}
// this is actually KL symmetric divergence because it measures
// KL(P||Q) + KL(Q||P)
double kl(const sample& c) {
double total = 0.0;
for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
if(i->second.first != 0 && i->second.second != 0) {
total += i->second.first * log(i->second.first / i->second.second);
total += i->second.second * log(i->second.second / i->second.first);
}
}
return total;
}
// this is Jensen-Shannon divergence, which is based on KL divergence
double js(const sample& c) {
double total = 0.0;
for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
if(i->second.first != 0 && i->second.second != 0) {
double middle = (i->second.first + i->second.second) / 2;
total += i->second.first * log(i->second.first / middle) / 2;
total += i->second.second * log(i->second.second / middle) / 2;
}
}
return total;
}
double cos(const sample& c) {
double a_sq = 0.0;
double b_sq = 0.0;
double total = 0.0;
for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
a_sq += i->second.first * i->second.first;
b_sq += i->second.second * i->second.second;
total += i->second.first * i->second.second;
}
return 1 - total / (sqrt(a_sq) * sqrt(b_sq));
}
At one point I had Caml-equivalance comments before all the functions, but they made me feel sad so I took them out. If you want to know about Kullbeck-Leibler divergence or Jensen-Shannon divergence, Wikipedia has you covered. Cosine (dis)similarity is also pretty well understood.
bool comparepermutation(const dialect& a, const dialect& b) {
dialect both_ab(a);
both_ab.insert(both_ab.end(), b.begin(), b.end());
int gt = 0;
for(int i=0; i < ITERATIONS; i++) {
sample total = normalise(permutation(a), permutation(b), 5);
double total_r = R_MEASURE(total);
size_t total_types = total.size();
double perm_r =
R_MEASURE(normalise(permutation(both_ab), permutation(both_ab), 5));
if(perm_r > total_r) {
cout << '-' << flush;
gt++;
if(gt > SIGNIFICANCE) return false;
} else
cout << '.' << flush;
}
cout << endl;
return true;
}
OK, one last processing function: a permutation test. A permutation test tests for statistical significance. Here, the null hypothesis is that there the distance produced by r could be produced by ANY two random sites. To test the null hypothesis, I combine corpora A and B to both_ab, then I sample twice from both_ab. If the null hypothesis is false, then this distance between these two will be less than the distance between A and B.
Of course to make sure this is significant for p < 0.05, I have to run it at least 20 times. That's what ITERATIONS is (and SIGNIFICANCE is ITERATIONS / 20). R_MEASURE actually varies over all the distance measures I showed above, so I guess it should be just called MEASURE. Oh well.
Anyway, the loop can break as soon as more than 5% of the random mix distances are larger the original distance—p is no longer less than 0.05.
Yeah, so the last few functions just read in files and kick off the process. Nothing too complicated here, just a nested loop to read in sentences, one word per line, separated by ***.
pair<string, vector<vector<string> > > readfile(const char* filename) {
ifstream f(filename);
if(!f) {
cout << filename << " is not found." << endl;
throw filename;
}
string lang;
getline(f,lang);
vector<vector<string> > sss;
vector<string> ss;
string s;
while(f) {
getline(f,s);
if(s=="***") {
sss.push_back(ss);
ss = vector<string>();
} else {
ss.push_back(s);
}
}
sss.push_back(ss);
return make_pair(lang, sss);
}
Finally, in icesig.cpp:
#include "icecore.h"
int main(int argc, char** argv) {
if(argc==2) {
pair<string, vector<vector<string > > > l1 = readfile(argv[1]);
cout << "Control: " << l1.first << endl;
cout << comparepermutation(l1.second, l1.second) << endl;
} else if (argc==3) {
pair<string, vector<vector<string > > > l1 = readfile(argv[1]);
pair<string, vector<vector<string > > > l2 = readfile(argv[2]);
cout << l1.first << endl;
cout << l2.first << endl;
cout << comparepermutation(l1.second, l2.second) << endl;
} else {
cout << "Not enough arguments: " << argc << endl;
}
return 0;
}
This program calls readfile and comparepermutation to find out whether the distance between the two corpora is significant. Other versions return the distances or the top features. They're very similar. This is the simplest so that's what I went with.
Anyway, if you have comments or questions, I will try to answer them. I am always happy to find bugs in this code.