Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 59 >>

12/03/10

Permalink 10:17:03 am, 116 words
Categories: Games

Half-Life 2 is a nightmare

I finished Half-Life 2 about a month ago and thought about writing a review. But every time I started the first sentence was “Half-Life 2 is a first-person shooter, but I thought it had too much shooting. Also the first-person perspective was over-used.” So I am a uncomfortable writing a review: FPS is not a genre I like and my review might turn into a rant about a genre I don’t like. Not that other reviews turn into rants about genres I *do* like (see: my reviews of FFX and DQ4).

So I’ll just say that Half-Life 2 is the game realisation of every nightmare I’ve had. Except for the one with spiders oh wait they got that too.

07/03/10

Permalink 01:40:33 pm, 1452 words
Categories: Code, Linguistics

Bing Maps API

Recently my advisor suggested that I compare my results to travel distance between towns instead of straight-line geographical distance. That seems like a good idea—if there’s a huge lake and a mountain in between two towns, the languages are likely to be very different. The problem is taking the time to gather all the distances; it’s not a big deal to identify 30 or so sites on Google Maps, but when you need to find the 435 distances between all 30 sites, you probably want a script to do it.

The obvious choice for this is an API to the mapping service you were going to use in the first place. For me, that led to a couple of choices:

Google Maps vs Bing Maps

Google Maps requires me to set up a web page (publicly accessible) and make a visible embedded map, then manipulate it with Javascript. This is really intended for mashups I think. I was disappointed because I prefer to let my knowledge of HTML and Javascript rot and moulder. (I found some discussion that indicates there is no SOAP API, although it’s still possible that I just missed it.)

In contrast, the 3rd Google result for “google maps soap api” is for Bing Maps, admittedly under the previous branding “Virtual Earth". The SOAP API that allows you to pretend (given a Proper Library) that you are calling a function that magically gives distances between two points. The Proper Library manages all the XML conversion/parsing and sending over the Internet. So Bing Maps won this one, because I just want to import a library and call a function from a script instead of embedding the whole thing in a web page. This led to a second choice:

Programming Language

I momentarily considered Haskell, because it does have a lot of libraries listed in Hackage. But it’s pretty weak on the XML front and its lone SOAP library is sort of just a text wrapper around XML that you have to build yourself. No XML conversion or parsing, just the sending-over-Internet part.

From there I went to Python. I quickly found that Dive Into Python has a chapter on SOAP, but it’s 6 or 7 years out of date, and SOAPpy, the library it uses, to has been replaced by ZSI, “Zolera Soap Infrastructure". I distrust these companies that start with Z (*cof* Zope *cof* Zend) and the source was package in a .egg (or maybe a .muffin? no it was definitely .egg) which I don’t even know how to install and it was all just Fear-Uncertainty-Doubt.

The real reason that I wasn’t enthusiastic about installing a Python library is that (1) I probably will never use SOAP again and (2) in the Bing Maps documentation, it mentions that Visual Studio has SOAP libraries installed with it by default. And all the examples are in C#, so it becomes a matter of cut-and-paste. And Visual Studio has a wizard for generating the proxy objects you need in a static language. (Imagine that, Visual Studio having a wizard!) Well, the wizard was the last straw—I mean who can resist all that code the wizards generate for you? Plus it was getting late and I just wanted to get something working. I even resisted the temptation to write the code in F#, because F# is extremely short on wizards right now.

(The code would have been a lot better in F#, though. It has tuples and pattern matching and it’s hard to write a program these days that doesn’t use one or both of those features. You’ll see when I show you the code.)

So let’s see. Here’s the gist of my workflow. I went to Bing Maps and identified all 30 of the interview sites. Then I dumped to some format called KML. Now, Bing Maps doesn’t make it obvious how to give somebody a link to these sites, but the option to dump to KML is front and centre. I suspect that it started life as a desktop product. Anyway, it’s annoying for directions but it’s perfect for my purposes. (Hint: click the icon of a letter in the bottom-left, even if you just want a link to the map.)

I did grep coordinates swedia.kml >pbcopy and pasted that in to an emacs buffer. From there I recorded some macros to produce a Python literal of a list of pairs. With the list loaded in the REPL, I added on the location names to each location, something like this:

[(site,lat,long) for (site,(lat,long)) in zip(consts.swediaSites, coords)]

At this point I had a list in Python whose elements looked like this:

("Ankarsrum", 16.33, 57.70)

Then I pasted the list back into the emacs buffer and added some noise around each line to make it valid C#:

sites.Add(new Loc("Ankarsrum", 16.33, 57.70));

I should have just written a string format comprehension in Python but I was really tired by this point.

This, of course assumes a struct named Loc and a List<Loc> called sites. But that’s easy enough to do in C#. Pardon my outmoded accent—the machines I coded for at Bing had not yet updated to .NET 3.5 so I didn’t really learn a lot of 3.5 conveniences fluently.

    struct Loc
    {
        private string _site;
        private double _lat;
        private double _long_;
        public String Site { get { return _site; } }
        public double Lat { get { return _lat; } }
        public double Long { get { return _long_; } }
        public Loc(String site, double lat, double long_)
        {
            _site = site;
            _lat = lat;
            _long_ = long_;
        }
    }
    class Program
    {
        private static List<Loc> sites = new List<Loc>(30);
        private static void init()
        {
            sites.Add(new Loc("Ankarsrum", 16.331280825605035, 57.700431990646344));
            // ... 29 more of these ...
        }

OK, so that’s the boring setup. Then I needed a pairwise iteration over a list. This is a lot easier with linked lists and type inference—managing indices is fiddly and the type annotations get unwieldy so I said Forget It and wrote some C-like code instead. I didn’t want to fiddle with it to get it right. (But it is pretty ugly).

        private static IEnumerable<int[]> pairwise(int n)
        {
            for (int i = 0; i < n; i++)
            {
                for (int j = i + 1; j < n; j++)
                {
                    yield return new int[] { i, j };
                }
            }
        }

Finally, the code in Main is not all that long as badly written C# goes:

        static void Main(string[] args)
        {
            init();
            var req = new BingMaps.RouteRequest();
            req.Credentials = new BingMaps.Credentials();
            req.Credentials.ApplicationId = 
              "Very-Long-String-You-Get-By-Registering-With-Bing";
            BingMaps.Waypoint[] route = new BingMaps.Waypoint[2];
            req.Waypoints = route;
            var client = new BingMaps.RouteServiceClient(
              "BasicHttpBinding_IRouteService");

            foreach (int[] indices in pairwise(30))
            {
                var from = sites[indices[0]];
                var to = sites[indices[1]];
                route[0] = new BingMaps.Waypoint()
                { 
                    Description = from.Site,
                    Location = new BingMaps.Location() { 
                      Latitude = from.Long, Longitude = from.Lat 
                    }
                };
                route[1] = new BingMaps.Waypoint()
                {
                    Description = to.Site,
                    Location = new BingMaps.Location() { 
                      Latitude = to.Long, Longitude = to.Lat 
                    }
                };
                var d = client.CalculateRoute(req).Result.Summary.Distance;
                Console.WriteLine(string.Format("(\"{0}\", \"{1}\"): {2},", 
                  from.Site, to.Site, d));
            }
        }

I added all the type inference it could handle and shortened some of the ridiculous example names. The API is still pretty verbose but I think that’s because it’s SOAP. Let’s see, you have to

  1. Create a request.
  2. Create new credentials and give it your personal key.
  3. Create a new route and add the route to the request.
  4. Create a new Route Service client. (There are several other services, geocoding for example.)
  5. Inside the loop, set each point in the route.
  6. Also inside the loop, call client.CalculateRoute, passing in the request. This is the actual SOAP call that sends XML across the wire and back.
  7. Print out the distance of the route summary.

If you specify two points that do not have a connecting route, then you get a hilarious null pointer exception somewhere along the chain CalculateRoute().Result.Summary.Distance. It seems to me that this should be an exception, but no, just null.

Really, though, the only reason that I got null pointer exceptions is that KML stores its points in Longitude, Latitude order, while the rest of the world uses Latitude, Longitude order. So instead of two Swedish villages, I requested the distance between two points in the Arab Sea. The alert reader will notice the switcheroo I have to pull to make this work. I should have changed the Loc constructor but I didn’t think of that until just now.

The code takes about 100 seconds to get all 435 distances. It’s faster than I expected, so I guess most of the lag in the web interface is the browser drawing the route.

In conclusion, this probably won’t be useful to anybody else, but at least maybe you’ll think it’s cool to see how you can easily get driving distances between all unique pairs of sites.

03/03/10

Permalink 02:43:15 pm, 1007 words
Categories: Code, Linguistics

Reading and writing consensus trees in Haskell

I use the consensus tree code I wrote about the last two times in my dissertation to find agreement between clusters of Swedish dialect interviews. The input is cluster trees from R (the open source S) and the output is formatted for use by the qtree Latex package. I also need to filter out cluster trees that contain non-significant distances—all the input distances should be statistically significant.

main = do
  sigs <- withFileLines findSigs "sig-10-1000-interview.csv"
  interactFiles (withFileLines (makeConsensusTree sigs & list)) qtree

makeConsensusTree uses the list of significant files and qtree converts a Tree to a String, suitable for output to stdout. But first, findSigs:

findSigs =
  tail
  & map (replace ',' ' ' & words & tail
         & zip ["path", "trigram", "dep", "psg", "grand", "unigram", "all"]
         & filter (snd & (=="0"))
         & map fst)
  & zip ["r", "r_sq", "kl", "js", "cos"]
  & concatMap (uncurry (zip . repeat))
  & Set.fromList

(Reminder: (&) == flip (.)) findSigs is a hoky CSV parser. It drops the column heads (tail), then chops each line into fields (map (replace ',' ' ') & words & tail), assuming that they don't contain spaces so that it can use words to do the chopping.

Then it zips the hard-coded column heads back on, and throws away fields that aren't (=="0")—that is, feature sets that have more than 0 non-significant distances.

For each line, the significant feature sets are zipped with their with distance measure, and a hilarious and incomprehensible concatMap call turns them into a list of pairs like [("path","r"), ("trigram", "r"), ("path", "js") ...].

I know this is statically typed, but I don't think anybody would hesitate to call this ad-hoc mess a script. It was written in about 5 minutes for a specific file and will break as soon as that file changes.

Moving on.

makeConsensusTree sigs =
  filter (isPrefixOf "Cluster: ")
  & map rReader
  & filter (fst & (`Set.member` sigs))
  & map (snd & buildRTree)
  & consensus

makeConsensusTree is the top-level function for reading in the different dendrograms and writing out a consensus tree. Its first task is to read Yet Another Tree Format, this time R's binary dendrogram format. Each tree is on a line that starts with "Cluster: " (hence, filter (isPrefixOf "Cluster: "). Then the measure and feature set follows, and finally a list of numbers. If you divide those numbers in half (which rReader does), you get two columns.

rReader line = 
  ((measure,features),
   splitAt (floor (fromIntegral (length ns) / 2)) ns |> uncurry zip)
  where (_:measure:features:ss) = words line
        ns = map read ss

For example, if you have the line

Cluster: r all -2 1 -4 2 -4 -9 -8 3

The numbers are interpreted as the following two columns:

1.-2-3
2.1-9
3.-4-8
4.23

buildRTree further interprets these numbers as a binary tree: negative numbers are leaf nodes, (negative) indices into the interview sites or whatever the things are that you're clustering. Positive numbers are internal nodes, indices that refer to a previously created node. The index it refers to is the step at which the node was created. So, in the above example, internal node 1 has two leaf children: values 2 and 3. Internal node 2 has two children. One is node 1 and the other is a leaf (with value 9). The last node, 4, is the root. For the tiny example above, the Lisp representation of the tree is:

'(((2 3)
   9)
  (4 8))
label (Leaf a) = a
label (Node a _) = a
buildRTree tuples = last nodes
  where nodes = map buildNode tuples
        buildNode (i,j) = Node ('$':label inode++label jnode) [inode,jnode]
          where (inode,jnode) = (buildChild i, buildChild j)
        buildChild i | i < 0 = Leaf $ swediaSites !! (-i-1)
                     | otherwise = nodes !! (i-1)

The way that buildRTree implements this is kind of clever; it has to refer to previous parts of the node list as the list is being built. In Haskell, laziness is the most natural way to do this. So when buildChild implements the negative/positive → leaf/non-terminal decision, it refers to nodes for the non-terminal case. But nodes is just map buildNode and buildNode first calls buildChild and oh-no-circular-definition!

But it turns out all right, because, assuming that R has properly dumped the node indices, buildChild's i will always be less than the length of nodes at the time that it's required.

(Of course, you should always avoid (!!) and this particular use is egregiously O(n2) blah blah, but remember once again that this is a script and that I happen to know that I'll never run this on trees with more than 100 terminals. And that I'm willing to wait for up to 5 minutes.)

qtree (Leaf a) = Set.findMax a
qtree (Node a []) = Set.findMax a
qtree (Node a kids) = "[. {" ++ left ++ "} " ++ right ++ " ]"
  where (leaves,nodes) = partition leaf kids
        left = intercalate "\\\\" (map qtree leaves)
        right = unwords (map qtree nodes)

Finally, qtree converts a tree into Yet Another Tree Format, this time the qtree Latex package. I can then paste the string into my dissertation. It's pretty straightforward except for the abuse of Set.findMax—it's not finding anything since there's only one item left in the set anyway. I just don't have a way to destructure sets.

Well, now you've seen how messy script code can be in Haskell. I think it's a pretty good argument that static typing doesn't buy you much up front—I can be as messy and ugly as I want in either Python or Haskell. All you can hope for is that the type system gets out of your way while you're doing it. The win comes in six months when I decide I need to make the code more robust or (more likely) add a tiny feature. Haskell is more discoverable out of the box—no type documentation needed: the compiler figures it out for you—and more reassuring to make changes—less need for tests: many breaking changes will cause the compiler to complain.

In case you notice a tacit unwillingness to write documentation and tests for my script code, you got me. Yup. I wrote tests for some of it, but after the code was done. And documentation? I'm writing a dissertation! It's pages and pages of high-level documentation! (Just don't expect any commentary on actual code.)

01/03/10

Permalink 11:30:27 am, 1177 words
Categories: Code, Linguistics

Consensus trees implemented in Haskell, imperatively

Last time I showed a functional implementation of consensus trees that ended up having to pass state around. It was kind of ugly. Fortunately, this pattern is common in Haskell, so there is a wrapper for it in the form of the State monad.

If you haven’t used the State monad before, it gives you one (1) variable—it may not sound like much, but the number was zero (0) before. All state-changing functions implicitly operate on this variable. This makes your code look a lot like Perl code (where the one variable is named $_, if you ever bother naming it). I apologise in advance if this gives you fever or convulsions.

The only code that changes is consensus, buildNode and buildTree, so you’ll have to refer to the previous post for the rest of the code. Here are the functional originals:

consensus trees = majority (map root trees) |> buildTree |> fst
buildTree [span] = (Leaf span, [])
buildTree (span:ranks) = let (kids, rest) = buildNode span ranks in
                         first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree (span:rest))
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

Before changing this code to use the State monad, let’s get started with some utility code. I think it’ll help you understand State if you haven’t worked with it before.

isStackEmpty = return . (==[]) =<< get
peek = return . head =<< get
pop = do
  first:rest <- get
  put rest
  return first
push x = put . (x:) =<< get
ifM cond seq alt = do
  b <- cond
  if b then seq else alt

If you remember from last time, we were passing around a tuple. The second element of the tuple always holds the remaining spans that are potential children for the current span. Here, that list of spans is stuffed into the State monad. Then, the State accessor functions like get and put let you mutate this list.

Internally, I believe that State is doing exactly the same tuple-passing trick as the original code. At least, that’s the way it’s always been explained to me. I suppose it could be creating a Real Mutable Cell for me. It would come out to the same thing.

Anyway, isStackEmpty is representative: get the span list, check to see if it’s empty, and return. peek, push and pop are variants on the same thing.

ifM just embeds if in a monad. Not too much to say about it.

OK, let’s port that code to the State monad, starting from the top down. consensus doesn’t change much. It runs evalState instead of fst, and that’s about it. evalState takes two arguments: a function to run in the State monad, and the initial value of the state variable. It returns the value of the function and throws away the state variable. (Conversely, execState throws away the value of the function and returns the state variable.)

consensus trees = evalState (buildTree span) ranks
  where (span:ranks) = majority (map root trees)

Next up is buildTree:

buildTree span =
  ifM isStackEmpty
    (return (Leaf span))
    (do
       kids <- buildNode span
       nodes <- mapM buildTree kids
       return (Node span nodes))

In buildTree, pattern matching goes away but the intent of isStackEmpty is the same. The true branch is also basically the same. However, the false branch is markedly simpler and the order of evaluation is clear: first find the current span’s kids, then build the nodes for each of the kids, then put them together in a new Node. All the shenanigans with first and foldl go away, because mapM gets the State monad to do all the tuple mangling in the background.

Of course, if you like putting everything on one line (like me), you can do that too…after all, do is just sugar for a bunch of higher order functions.

buildTree span =
  ifM isStackEmpty
    (return (Leaf span))
    ((return . Node span <=< mapM buildTree) =<< buildNode span)

Note that, as ugly as the one-liner is, if it weren’t monadic, it would be

(return . Node span . mapM buildTree) $ buildNode span

Haskell just forces us to mark compose and apply differently if they’re monadic or not. ((.) vs (<=<) and ($) vs (=<<)). This means that you have to learn two versions of some functions and know when to use them. On the plus side, you can be very sure where you are accessing state.

Whether this is important is a philosophical/cultural question more than a linguistic one: we’re edging into Sapir-Whorf territory here. For comparison, when writing C, I always know when I’m referring to something by reference versus by value because I have to very carefully mark pointers. There are even two syntaxes for attribute access (x.y vs p->y). Again, whether this is important should be decided mostly by your problem—it’s not possible to say whether marking by-reference is universally more important than marking state changes.

OK, end of digression. Let’s see how buildNode changes:

buildNode span =
  ifM isStackEmpty
    (return [])
    (ifM (return . (`Set.isSubsetOf` span) =<< peek)
         (do
           rank <- pop
           span' <- buildNode (span `Set.difference` rank)
           return (rank:span'))
         (do
           rank <- pop
           span' <- buildNode span
           push rank
           return span'))

Again, the empty check and terminal case are pretty much the same. However, the subset check is more complicated than before because of more monadic composition.

As with buildTree, the order of events is much clearer. Especially in the non-subset case, it’s much clearer that you pop off the current span, look at the rest, then push it back on when the recursive call returns. The difference between the two cases is maybe not quite as clear since they don’t sit on adjacent lines.

And of course you can write more of buildNode on one line too*:

buildNode span =
  ifM isStackEmpty
    (return [])
    (ifM (return . (`Set.isSubsetOf` span) =<< peek)
         (liftM2 (:) peek (buildNode . Set.difference span =<< pop))
         (do
           rank <- pop
           span' <- buildNode span
           push rank
           return span'))

So that’s the imperative version. It’s easier to read, but I had more trouble thinking about how to write it. It was easier to stop and think about the types when I was manually passing around the state. Also, the type errors while using State seemed pretty hard to understand. Probably I haven’t used it enough to get used to them.

Next time, the stupid script wrapper that goes around the whole thing. Not particularly impressive, but I haven’t blogged any of my small Haskell “scripts” that do actual file-to-file transformations. Usually pretty simple, but just complicated enough that I want to use Haskell instead of Python.

*If you saw cond in there, you’re not alone:

t = return True
condM [] = error "Should have used t"
condM ((test,action):conditions) = ifM test action (condM conditions)
buildNode' span =
  condM [(isStackEmpty, return [])
        ,(return . (`Set.isSubsetOf` span) =<< peek,
          (liftM2 (:) peek (buildNode' . Set.difference span =<< pop)))
        ,(t, do
            rank <- pop
            span' <- buildNode' span
            push rank
            return span')]

26/02/10

Permalink 03:39:32 pm, 1580 words
Categories: Code, Linguistics

Consensus trees implemented in Haskell

An implementation of consensus trees in Haskell

I’ve used hierachical clusters for the last few years as a part of my analysis of distance measures. But they are not super reliable—small changes tend to make corpora jump from cluster to cluster. One way to fix this is to introduce noise into the data and build a consensus tree. A majority consensus tree only creates clusters that the majority of hierarchical clusters agree on.

[PDF] Example of a single hierarchical cluster dendrogram

The problem is that this is a well-understood but obscure 30-year-old concept. You know what that means: the only off-the-shelf implementations are either (1) twenty years old (2) hopelessly obfuscated variants focussing on efficiency or (3) part of some custom research code that some scientists just threw online.

Yeah, so you’re going to get (3) in this post. Actually it turns out to be pretty easy to build consensus trees, so I guess that’s why there aren’t any good implementations out there.

data Tree a = Leaf a | Node a [Tree a] deriving (Show)
root tree = Node "ROOT" [Leaf "s0", tree]
leaf (Leaf _) = True
leaf (Node _ []) = True -- irregularities in buildNode, oh well
leaf _ = False

The shenanigans start early with two representations of leaf nodes sneaking in. I should probably just standardise on the (Node _ []) representation but having a Leaf constructor around makes literals easier to type. Fighting for the side of consistency, root provides a consistent root in case NONE of the trees can agreement on a root.

consensus trees = buildTree span rest |> fst
  where (span:rest) = majority (map root trees)
majority trees = trees
               |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
               |> Map.unionsWith (+)
               |> Map.filter (>m)
               |> Map.keys
               |> sortBy (comparing (negate . Set.size))
 where m = floor (fromIntegral (length trees) / 2)
spans (Leaf a) = Set.singleton $ Set.singleton a
spans (Node a kids) = Set.insert (Set.unions $ Set.toList kidspans) kidspans
  where kidspans = Set.unions $ map spans kids

This is the first half of the code: consensus finds the majority spans and rebuilds them into a tree. Spans are fairly simple: it’s just the leaves that a particular node dominates. For example, the green node in the first tree has the span {a b c} and the green node in the second tree has the span {b c}.

spans represents spans with sets so that it can glom all the subspans together with a big Set.unions and be sure that there aren’t any dupes. The recursive case figures the spans for all kids, then unions them together, calls that the parent’s span and adds it to the kids’ spans.

Since Haskell doesn’t have list literals, the output is kind of messy, but if you convert to lists, the output from the example tree is:

[["a"]
, ["a", "b", "c"]
, ["a", "b", "c", "d"]
, ["b"]
, ["b", "c"]
, ["c"]
, ["d"]]

Now if you run that on the following three trees, you’ll get three slightly different sets.

Once you combine those sets, the question is which spans are present in the majority of trees. For example, {a b c d} is present in all three trees. But {a b} is only present in tree 1, while {b c} is present in both tree 2 and tree 3. So {b c} will be part of the consensus tree (2/3 > 50%), but {a b} won’t (1/3 < 50%).

After calling spans, majority combines everything into one map via some gymnastics:

... |> map (spans & Set.toList & flip zip (repeat 1) & Map.fromList)
    |> Map.unionsWith (+)

In other words, map each span in spans to 1. Then combine the spans into one map using (+) to combine identical entries.

Next we filter for all the spans that occur in the majority of trees and then throw away the counts (we never really cared about them anyway).

... |> Map.filter (>m)
    |> Map.keys

Finally, we sort the sets by size. For the trees above, this gives us:

{a b c d} {b c d} {c d} {a} {b} {c} {d}

Now we can rebuild the tree recursively: for each span, look at the rest of the list for spans that are subsets of the current span. For example, with {a b c d}, the children should be {b c d} and {a}. Here’s the code to do that.

buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)
buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

What we have here is a cleverly masked, hand-rolled implementation of the State monad. Err, well actually I suppose that I’m getting ahead of myself. Let me just explain the code as is and later I’ll show the “imperative” version. Let’s start with buildNode since it does the actual matching of {a b c d} with {b c d} {a}.

So: buildNode receives two things:

  1. The span (here called span) to find children for.
  2. A list of other spans. The spans are sorted by size, so if the first other span is not smaller than span, then buildNode can just continue on down the list.

And: buildNode returns two things:

  1. The children of span.
  2. The same list of other spans, but with span’s children removed.

Number 2 is important because we need to keep this list around. Here’s why: start off finding the children for {a b c d} in the list

{b c d} {c d} {a} {b} {c} {d}

That’s easy:

{b c d} {a}

OK, now find the children for {b c d}. In what list? Well, whatever’s left from the previous call:

{c d} {b} {c} {d}

This gives a new (children,rest) pair:

({c d} {b}, {c} {d})

And the remaining spans {c} {d} will be needed to find the children of {c d}. All right, back to the actual code in buildNode. The (children,rest) pair is (nil,nil) after buildNode runs out of possible child spans to search for.

buildNode _ [] = ([],[])
buildNode span (next:rest) = if Set.isSubsetOf next span
  then first (next:) (buildNode (span `Set.difference` next) rest)
  else second (next:) (buildNode span rest)

Otherwise it checks whether the next span is a child of span; that is, if it’s a subset of span. If so, then it removes that subset (with Set.difference) and conses it onto the rest of the children that buildNode has found recursively. On the other hand, if the next span isn’t a child, then buildNode recurs on the rest of the spans and conses it back on afterward.

An aside about first and second. They’re defined in Control.Arrow, which is a module I really don’t understand, but for this use, you can imagine that they are defined like this:

first f (a,b) = (f a, b)
second f (a,b) = (a, f b)

In other words, if we call first, we’re doing something to the children of span. If we call second, we’re doing something to the spans that are left to be searched. For example, the second clause of buildNode picks the head off those spans. If the head is not a subset of span, then it sticks it back on with second (next:) after the recursive call returns in the last line.

OK, now we can look at buildTree:

buildTree span [] = (Leaf span, [])
buildTree span ranks = let (kids, rest) = buildNode span ranks in
                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)

buildTree has the same inputs as buildNode: a span and the list of spans to search for. It delegates finding the child spans to buildNode and concentrates on building up the tree structure properly. Like buildNode, it returns a pair: (tree, rest).

The non-terminal case operates in two steps:(1) it finds the kids for span by calling buildNode and (2) it recursively builds the tree structure for the kids by calling buildTree.

(1) is simple enough:

let (kids, rest) = buildNode span ranks in

But (2) is pretty hairy, because every time buildTree is called, we have to capture the rest and pass it to the next call of buildTree. That means instead of map buildTree kids we have to use a fold.

                       first (Node span) (foldl buildKids ([],rest) kids)
  where buildKids (nodes,rest) span = first (:nodes) (buildTree span rest)

This particular fold iterates over kids (just like map would have done), and its fold function is basically (buildTree kid : alreadyBuiltNodes) (just like map would have been). Everything has been wrapped in a tuple: the start state is not [] but ([],rest), and the fold function has to wrap the buildTree call in a call to first. It’s just kind of ugly.

But! It works! If you load the preceding code, plus the following examples, you’ll have working consensus trees.

t1 = Node "a" [Node "b" [Node "c" [Leaf "s1", Leaf "s2"], Leaf "s3"], 
               Leaf "s4"]
t2 = Node "a" [Node "b" [Leaf "s1", Node "c" [Leaf "s2", Leaf "s3"]], 
               Leaf "s4"]
t3 = Node "a" [Leaf "s1", Node "b" [Node "c" [Leaf "s2", Leaf "s3"], 
               Leaf "s4"]]
ts = [t1,t2,t3]
-- Now try:
--> consensus ts

Of course this is Haskell, A Large Language, so there is always more than one way to do it. Next time I’ll switch buildTree and buildNode to explicitly use the State monad. You can decide if you like the short lines and frequent mutation of imperative programming.

1 2 3 4 5 6 7 8 9 10 11 ... 59 >>

powered by b2evolution free blog software