Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 58 >>

01/02/10

Permalink 05:38:20 pm, 3026 words
Categories: Code, Linguistics

Implementing the Chu-Liu-Edmonds parser (Python version)

This presentation is taken from discussion from the parsing read group in the Indiana University computational linguistics group, and from the book my advisor Sandra Kuebler co-authored, Dependency Parsing. I haven’t read the rest of the book since it came out while I was in Seattle, but I can only assume that it’s a great place to learn about dependency parsing.

Once upon a time, I believed that if I could implement an algorithm, I would understand it. Grad school made me grow out of the assumption—I’ve implemented several algorithms that I didn’t understand. Still, implementing an algorithm helps me understand it better. Blogging about it improves my understanding even more. So over Christmas I implemented the Chu-Liu-Edmonds parser. It’s a parser co-developed by one of the other authors of Dependency Parsing (who, surprise surprise, now works at Google). The algorithm is very cool, and not terribly difficult to implement. However, I don’t completely understand it, and I’ve been sitting on this post for several weeks. So I think I’m just going to post what I have and if you are curious, you can read the original paper for a good explanation.

Note: if my layout and your browser conspire to make the viewable code width less than 80 characters, try viewing the complete file in a separate window: cle.py.

Second note: This is nearly the same discussion as before, but with Python code instead of Haskell code. This changes only a few things because this is a translation. In exchange, the Python is non-idiomatic at times. If this bothers you, please suggest replacements.


The Chu-Liu-Edmonds parser is a dependency parser that works in a very clever way. It frames the parsing problem as a graph problem: start with a fully connected graph and recursively remove the “right” edges until you are left with a tree. This approach allows you to take advantage of efficient graph algorithms already developed by computer scientists. No need to roll your own code. Actually, the same thing is happening in phonology as it switches from Optimality Theory (OT) to the more powerful Harmonic Grammar (HG). OT doesn’t map well to existing algorithms, so the custom ones linguists have made are not any faster than the HG ones, because HG is basically just a linear programming problem.

Anyway, the real question is what are the “right” edges to be removed. The starting edges are estimates of the likelihood of a dependency between two words. So the edge between the subject and the predicate should be much more likely than the edge between the subject and a preposition much later in the sentence. The best edges are the ones that contribute overall to the highest-likelihood tree.

So that gives us our terminal case: if the graph is acyclic, then it’s a tree, and if we have recursively removed only the least likely edges, then it’s the most likely parse. For the recursive case, where the graph has cycles, then the cycles have to be broken in the optimal way. Note: “the graph” refers to a particular greedy transformation of the graph, specified below.

The recursive case is where things get tricky: one cycle is broken at each recursive step. I understand how one cycle is broken, but I’m not sure I understand how it recursively breaks more than one. I do know that the basic idea is to:

  1. Find a cycle, any cycle.
  2. Remove all nodes of the cycle and stick in a placeholder node that says “There is a cycle here; I’ll break it once all the other cycles are broken.”
  3. Recur: find a cycle, any cycle…
  4. If you can’t find a cycle, you hit the terminal case. Return the current graph (well, the greedy transformation thereof).
  5. Now it’s time to break that cycle. Remove the least likely edge.
  6. Remove the placeholder node from the graph and put back in all the nodes from the now-broken cycle. Return.

Actually, at a high level, I think I get it. It’s the low-level that is so confusing. Let me start presenting the code and see if that helps.

I’ll start with the graph data structure. This is a naive graph implementation that closely apes the pseudocode given in the book. There is a footnote that mentions that a O(n2) bound is possible, given clever graph data structures and algorithms. This implementation is not clever. Unfortunately, since this is Python, what were type synonyms have become comments:

# type vertices = set(str)
# type edges = set((str,str))
# type weights = {(str,str):float}

This is not the typical adjacency list implementation of a graph; instead, the edges are stored as lists of pairs (fromEdge, toEdge). This makes it easier to get all edges entering a vertex as well as leaving it. But not faster, both are equally slow this way. weights stores the edges redundantly because of the way the graph is copied during the recursion; weights may contain weights for edges that are not in the current graph, but were used in an earlier version before some cycles had edges removed. (I think it’s possible to avoid this redundancy, but I haven’t convinced myself of it.)

Then there are some methods that operate on Graphs. Since I stuck everything in the same module, I’m pretty sure this is not a water-tight Graph interface, but it’s close. Fortunately, since this is Python, everything is in a self-contained class. To find the holes, grep ‘.vertices|.edges|.weights’.

class Graph():
    def __init__(self, vertices, edges, weights):
        self.vertices = vertices
        self.edges = edges
        self.weights = weights
    def __match__(self):
        return (self.vertices,self.edges,self.weights)
    def __eq__(self, other):
        return (self.vertices==other.vertices and
                self.edges==other.edges and
                self.weights==other.weights)
    def __ne__(self, other):
        return (self.vertices!=other.vertices or
                self.edges!=other.edges or
                self.weights!=other.weights)
    def __str__(self):
        return ("Graph: " + str(self.vertices) + "\n" +
                "\n".join("\t%s -> %s (%s)" % (fro,to,self.weights[fro,to])
                          for (fro,to) in self.edges))
    __repr__ = __str__
    def copy(self, vertices=None, edges=None, weights=None):
        # structure sharing doesn't work in general in Python, but
        # I never update variables passed as parameters, so I don't need
        # to copy here. I think.
        return Graph(vertices or self.vertices,
                     edges or self.edges,
                     weights or self.weights)
    def weightTo(self, wj):
        return lambda wi: self.weights.get((wi,wj), -1)
    def weightFrom(self, wi):
        return lambda wj: self.weights.get((wi,wj), -1)
    def edgeTo(self, wj):
        return lambda wi: (wi,wj) in self.edges
    def edgeFrom(self, wi):
        return lambda wj: (wi,wj) in self.edges
    @property
    def verticesW(self):
        wordvertices = set(self.vertices)
        wordvertices.remove("ROOT")
        return wordvertices
    def kids(self, w):
        return set(filter(self.edgeFrom(w), self.vertices))
    def parents(self, w):
        return set(filter(self.edgeTo(w), self.vertices))

weightTo/From and edgeTo/From are intended to be used curried, usually like you see in kids:

return set(filter(self.edgeFrom(w), self.vertices))

This gives you all of g’s vertices that the vertex w points to. (dict.get and verticesW are needed because of irregularities resulting from the way I included the ROOT in the graph.)

After the data structure, we start with the actual parser. First is a convenience function cryptically called h in the pseudocode. I think “h” stands for “parse". Yeah. Maybe it’s a German abbreviation.

def h(s, lambdas):
    return cle(Graph(set(s), set((wi,wj) for wi in s for wj in s), lambdas))

h really doesn’t tell you much except (1) how to construct a Graph given a sentence and (2) cle is the actual parser.

def cle(g):
    return decycle(g, cycle(g.copy(edges=greedyArc(g))))
def greedyArc(g):
    return set((argmax(g.weightTo(wj), g.vertices),wj) for wj in g.verticesW)

cle operates in three steps. First, it runs greedyArc, which throws out all but the most likely edge that enters each vertex, which is the definition of “most likely overall sentence", I guess because it most closely approximates a tree originating at ROOT, since ROOT has no edges entering it. cycle checks this altered graph for cycles. Then it hands the graph off to decycle, which checks for the base case—if no cycles, then the altered graph is a tree, and hence the answer. Otherwise decycle removes a cycle from the original and recurs on cle. In fact, let’s look at decycle right now. (Unlike the Haskell version, you have to read cle right-to-left to see the three steps in order.)

def decycle(g, cyc):
    if cyc is None: return g.copy(edges=greedyArc(g))
    gc,wc,ep = contract(g, list(cyc.vertices))
    g2 = cle(gc)
    oldIns = set((wi,ep[wi,wc]) for wi in g2.parents(wc))
    oldOuts = set((ep[wc,wj],wj) for wj in g2.kids(wc))
    cycleBreakers = concat(set((wk,wj) for (wk,x) in cyc.edges if x==wj)
                           for (_,wj) in oldIns)
    cycleBreaker = argmax(lambda edge: -g.weights[edge], cycleBreakers)
    edges2 = g2.edges | cyc.edges | oldOuts | oldIns
    edges2.remove(cycleBreaker)
    return g2.copy(edges=edges2)

The first parameter to decycle is the graph and the second is the cycle, if any. Like I said, if there is no cycle, then the parser is done; the answer is the most likely tree I described in the previous paragraph. Otherwise, the answer is the graph obtained by a recursive call to cle, except for one edge: the least likely edge in the set cycleBreaker.

The tricky part is in the where-clause leading up to the definition of cycleBreaker. Three things happen here:

  1. We call contract to swap out the cycle for a substitute node. This new graph is called gc and the substitute node is called wc. ep holds all the original edges to and from all the nodes we just swapped out.
  2. We call cle recursively with gc to break the rest of the cycles. cle is magic*, so the g’ that it returns is perfect except that we have to break the current cycle and put all its nodes back in.
  3. To do that, we restore edges to g’; the nodes never went anywhere; they were just unreachable from the rest of the graph. We get oldIns and oldOuts by looking them up in ep. Then, cycleBreaker is, for each cycle node with an outside edge pointing to it, all the edges that enter that node. Only these cycle edges are candidates for removal. I don’t understand why, but there it is. Maybe because removing edges from a cycle-internal node would leave some node in the cycle unreachable?

Anyway, with all this backwork done, decycle returns the cleaned-up g’, plus restoring all the edges of the cycle (edges cyc) and all the edges entering (oldIns) and leaving (oldOuts) the cycle. Except for one of the edges in cycleBreakers. The edge is the minimum edge, as specified by the edge’s weight.

OK, only one more core function to go: contract. All contract does is substitute a temporary node for a cycle and record all the information needed to put the cycle back in later. However, it’s surprisingly complicated, partly (mostly?) because the weights of the new edges have to be calculated.

def contract(g, c):
    ancestors = dict([(c[0],c[-1])] + zip(c[1:], c))
    def max2(wi):
        return lambda wj: g.weightFrom(wi)(wj) - g.weightFrom(ancestors[wj])(wj)
    outside = g.vertices - set(c)
    outsideEdges = [(x,y) for (x,y) in g.edges if x not in c and y not in c]
    wc = gensym(c)
    outsEp = [((wc,wj),argmax(g.weightTo(wj), c))
              for wj in outside if any(g.edgeTo(wj)(wi) for wi in c)]
    insEp = [((wi,wc),argmax(max2(wi), c))
             for wi in outside if any(g.edgeFrom(wi)(wj) for wj in c)]
    ep = dict(outsEp + insEp)
    score = sum(g.weights[ancestors[w],w] for w in c)
    lambdas = dict(g.weights)
    for ((wc,wj),wi) in outsEp:
        lambdas[wc,wj] = g.weights[wi,wj]
    for ((wi,wc),wj) in insEp:
        lambdas[wi,wc] = max2(wi)(wj) + score
    outside.add(wc)
    gc = Graph(outside, set(ep.keys() + outsideEdges), lambdas)
    return (gc,wc,ep)

All right, this is kind of confusing, and I’m not sure I’ve organised it in the best way either. The main return value is gc, but it takes three important pieces of information: the vertices, the edges and the edge weights. The vertices, outside, are all the vertices outside the cycle. That’s not too complicated, just g.vertices - set©. Next, outsideEdges is all edges that don’t have an endpoint that is in c. (Plus the edges in ep, more on that in a second). Finally, lambdas is complicated because of the weights of the new edges.

So the most confusing part of the function is calculation of ep and lambdas. ep contains two type of edges: outsEp, the best edge from any node in the cycle to some outside edge, and insEp is the best edge to any node in the cycle. lambdas (outsW and insW) are the weights for these edges, which, even more confusingly, are not the same as the original weights!

Let’s step back a little. The purpose of contract is to substitute the cycle with a single node so that recurring calls to cle can eliminate the rest of the cycles. So all the edges that went in and out of the cycles’ nodes now have to go in and out of that new, single node. However, what happens if we have two edges (cycle-A, outside-C) and (cycle-B, outside-C)? We definitely want an edge (substitute-AB, outside-C), but which one should we use, cycle-A’s or cycle-B’s? And what should the weight be? Note: we can’t just keep both and call them BOTH (substitute-AB, outside-C) (substitute-AB, outside-C), at least not on a normal graph.

The way that contract solves this is to choose the edge that has the highest weight. So between (cycle-A, outside-C): 20 and (cycle-B, outside-C): 30, (cycle-B, outside-C) wins. outsEp records this as the pair ((substitute-AB,outside-C), cycle-B) so that decycle can later restore the edge (cycle-B, outside-C).

Now what about the weight? Well, for edges leaving the cycle, it’s pretty simple, it’s just the original weight—(cycle-B, outside-C)’s weight in our example. For edges entering the cycle, it’s more complicated. In fact, the weight is the weight of the original edge, plus the weight of reaching every other node in the cycle. That’s where cycleWeight comes in—for insW, the final weight is the weight of the edge in, plus the weight of going all around the cycle (cycleWeight), minus the last edge’s weight. The last edge’s weight gets left out because you just need to touch every node in the cycle, which means you stop before returning to the entry point.

One last thing: because you need to subtract that last edge’s weight in insW, you have to construct a list of the predecessor to each node in the cycle. That way, given a starting point in the cycle, you can remember what the other end is without having to walk all the way around the cycle each time. That’s what ancestors does: it walks around the cycle once, storing the endpoint for every starting point in the cycle.


I’ll finish with some utilities. I won’t describe them as closely, partly because their names should tell you almost everything about them.

## specialised utilities ##
def gensym(strings):
    return "$" + ''.join(strings) # HA HA
def cycle(g):
    total = set(["ROOT"])
    seen = []
    def dfs(w):
        if w in seen:
            return g.copy(vertices=set(seen[seen.index(w):]))
        else:
            for kid in g.kids(w):
                total.add(kid)
                seen.append(w)
                kidCycle = dfs(kid)
                if kidCycle is not None:
                    return kidCycle
            return None
    cycle = dfs("ROOT")
    while cycle is None and g.vertices != total:
        seen = []
        cycle = dfs((g.vertices - total).pop())
    return cycle
## general utilities ##
def maximumBy(f, l):
    if l==[]: raise Exception("empty list")
    it = iter(l)
    bestVal = it.next()
    best = f(bestVal)
    for x in it:
        next = f(x)
        if next > best:
            best = next
            bestVal = x
    return bestVal
argmax = maximumBy
def concat(ls): # old school concat: short and eager
    return [x for l in ls for x in l]

Well, if you’re a Lisp hacker. gensym is a name mangler. I believe that it won’t cause name collisions in the presence of recursion, but I haven’t tested it. It might need a trailing symbol too.

cycle’s function is straightforward—detect and return cycles in a graph—but I wrote the algorithm myself for this funny graph representation, so it’s probably not optimal. It’s based on Cormen, Lieserson and Rivest’s depth-first-search. I don’t need Python 3’s nonlocal here because I don’t rebind either of the captured names (total and seen). Also also, I don’t use sets as widely as you might expect because it turns out that I need the cycle to be ordered in order to construct the ancestors list in contract.

maximumBy is a Haskell-esque name for argmax, but I used argmax for this code because I kinda-mostly stuck with the original names for the code. (Also it’s shorter.) It would be more modern to write concat to return an iterator, but given my experience with iterators, it would fail in at least two mysterious ways. So I used a list, like the old days. Back in 2002. Back in my day. Yup. Python 3.0 is worthless, except for the nonlocal keyword, which I didn’t even use here. You’ll see!


Please report errors and improvements in this version. This is such a long post that I’m sure I messed up something.

I don’t want to leave anything out, so here are the tests for the Python code. They just use assert since I can barely hold one xUnit in my head at once, and it’s HUnit right now.

*Recursive.

Permalink 02:50:31 pm, 1750 words
Categories: Code, Linguistics

Chu-Liu-Edmonds parser implementation : a worked example

I’ll use the example from Dependency Parsing because it’s so short that it only requires one recursion. Anything more and this explanation would go on forever. My example is “heavily based on” Dependency Parsing, which is to say, totally ripped off, except that I redrew the figures myself using Omnigraffle. (Omnigraffle is the Mac equivalent of Visio, except that it’s a pleasure to use, and kind of overpriced.) There are actually more figures than in the book because these two posts together are about four times the space that they could give Chu-Liu-Edmonds.

As before, let’s start with the data. (Of course you will probably want to download the code to follow along. It’s updated slightly so you will want to update if you got it last time.) The example sentence is “John saw Mary". This is represented as a fully connected graph, plus a fourth node ROOT that has initial weights for each word. Here’s the Haskell code and a figure:

examplelambdas = Map.fromList $ [(("John","saw"), 20),
                                 (("John","Mary"), 3),
                                 (("saw","John"), 30),
                                 (("saw","Mary"), 30),
                                 (("Mary","John"), 11),
                                 (("Mary","saw"), 0),
                                 (("ROOT","John"), 9),
                                 (("ROOT","saw"), 10),
                                 (("ROOT","Mary"), 9)]
example = Graph { vertices = Set.fromList ["ROOT", "John", "saw", "Mary"]
                , edges = Map.keysSet examplelambdas
                , weights = examplelambdas }

The desired output is a single edge from ROOT to saw, and two edges from saw to John and Mary:

outputWeights = Map.fromList [(("ROOT","saw"),10)
                             ,(("saw","John"),30)
                             ,(("saw","Mary"),30)]


All right, let’s get started. The initial call looks like this:

cle example

which leads to:

cle g = g { edges = greedyArc g } |> ChuLiuEdmonds.cycle |> decycle g

So first we find greedyArc example. Remember, for each node, this is the best incoming edge:

Unfortunately, as you can see, this leaves us with a cycle. That means that cycle returns a non-Nothing value, which means that we fail decycle’s first case, the terminal case. Instead we continue on to the recursive case.

decycle g (Just cyc) =
  g' { edges = (minimum cycleBreaker) `Set.delete`
               Set.unions [edges g',edges cyc,oldOuts,oldIns] }
  where (gc,wc,ep) = contract g (Set.toList $ vertices cyc)
        g' = cle gc
        oldIns = g' `parents` wc |> Set.map (\ wi -> (wi,ep ! (wi,wc)))
        oldOuts = g' `kids` wc |> Set.map (\ wj -> (ep ! (wc,wj), wj))
        cycleBreaker = map (snd & \ wj -> Set.filter (snd & (==wj)) (edges cyc))
                           (Set.toList oldIns) |> Set.unions |> Set.toList

Start reading with the where clause. The first thing we do is run contract to swap out the John-saw cycle for a single node. In fact, I’m going to skip contract for a second to what happens just after contract with g’ = cle gc. Just take it on faith for now that when contract returns gc, it looks like this:

I hope that the vertices of that graph are not too surprising—John and saw are gone, replaced by $Johnsaw—I’ll explain the edge weights when we work through contract. In any case, this graph is passed to cle, starting the second recursion.

The first thing that cle does is to find greedyArc g (remember, the best incoming edge for each node) and see if it has any cycles:

Oh happy day! No cycle! We’re done! We found the answer! We can return! We’re done!

Oh, wait. Except that $Johnsaw is not a real node. We have to undo the substitution and put John and saw back in, plus eliminate the cycle that they form. But besides that we’re done.

OK, now I have to go back and explain contract, because it’s difficult to understand how oldIns and oldOuts help re-substitute John and saw without understanding the mapping ep that contract returns.

So. contract. Let’s take it one line at a time. And there are a lot of lines to take on.

contract g c = (gc,wc,ep)
  where gc = Graph (wc `Set.insert` outside)
                   (outsideEdges `Set.union` Map.keysSet ep)
                   lambdas
        wc = gensym c
        outside = vertices g `Set.difference` Set.fromList c
        outsideEdges = Set.filter (\ (x,y) -> not (x `elem` c || y `elem` c))
                                  (edges g)
        -- ep => endpoints. It's a mapping from arcs to
        -- the new virtual arc back to the node that it originally pointed at
        ep = Map.fromList (insEp ++ outsEp)
        lambdas = Map.unions [weights g, Map.fromList outsW, Map.fromList insW]
        outsEp = [((wc,wj),argmax (g `weightTo` wj) c)
                 | wj <- Set.toList outside, any (g `edgeTo` wj) c]
       insEp = [((wi,wc),argmax (maxCycle wi) c)
                | wi <- Set.toList outside, any (g `edgeFrom` wi) c]
        outsW = [((wc,wj), weights g ! (wi,wj)) | ((wc,wj),wi) <- outsEp]
        insW = [((wi,wc), maxCycle wi wj + cycleWeight) | ((wi,wc),wj) <- insEp]
          where cycleWeight = sum [weights g ! (ancestors ! w, w) | w <- c]
        maxCycle wi wj = weightFrom g wi wj - weightFrom g (ancestors ! wj) wj
        ancestors = Map.fromList $ zip (tail $ Prelude.cycle c) c

gc is the eventual return value, along with wc and ep. But gc is a data structure of 3 other things which have to be looked at separately. (wc is pretty simple; just remember that gensym = concat & (’$':), so concat ["John", “saw"] = “Johnsaw” and then (’$':) “Johnsaw” = “$Johnsaw”.) outside and outsideEdges produce the following graph:

However, this doesn’t include the new node or the edges into it. Before we can find those, though, we need to store the old edges. For the outward edges, outsEp, we have competition only for the edges to Mary: (John → Mary, 3) and (saw → Mary, 30). So (saw → Mary, 30) wins and is stored in a outsEp as ($Johnsaw → Mary, saw). outsW then copies the old edge weight so that the new edge will be ($Johnsaw → Mary, 30). Later this new edge will be incorporated into gc in the line

lambdas = Map.unions [weights g, Map.fromList outsW, Map.fromList insW]

insEp is tricker, as is insW. Both ROOT and Mary have a choice of incoming edge, one each to John and saw. Let’s look at ROOT first. The weight of the incoming edge to the new substitute node should be the weight of the original incoming edge, plus the weight to touch the rest of the nodes in the cycle. The complete cycle weight is 30+20=50. So: (ROOT → John, 9)’s insW mapping would be 9 + 50 - (saw → John): 9 + 50 - 30 = 29. Or you could look at it as (ROOT → John, 9) + (John → saw, 20). But that would require us to walk all-but-one-node around the cycle each time; it’s faster to walk around it once, record the total, and after that just subtract the weight of not walking that last edge. Compare this to (ROOT → saw,10), which gives 10+50-20=40. The winner is (ROOT → saw), which insEp copies as (ROOT → $Johnsaw, saw) and insW records as (ROOT → $Johnsaw, 40).

cycleWeight, maxCycle and ancestors are helper functions to allow the pre-walk of the cycle and storage of the ancestor edges to be subtracted. For this example, ancestors is pretty small: [(John → saw), (saw → John)]. Then maxCycle returns the weight of entering the cycle minus the previous edge’s weight. So for the (ROOT → John, 9), maxCycle returns 9 - 30 = -27.

The reason that maxCycle doesn’t include cycleWeight (50 in this example) is that cycleWeight is constant in contract, and maxCycle is used in argMax where a constant does nothing but slow you down. (Although a smart compiler would remove it, I suppose.)

OK. I’ve forgotten completely what’s happening outside contract, so let me remind you of what contract returns to decycle. The graph is

Plus, it returns wc = “$Johnsaw” and ep =

  1. incoming: [(ROOT → $Johnsaw, saw), (Mary $rarr; $Johnsaw, John)]
  2. outgoing: [($Johnsaw → Mary, saw)]

(Remember that ROOT has no incoming edges.)

So. Done with contract, we are now back on the line

(gc,wc,ep) = contract g (Set.toList $ vertices cyc)

Now we continue recur with g’ = cle gc, find that we’re done, etc etc. We already covered this. We’ve got this:

and we need to restore $Johnsaw before we are done. Here is where oldIns, oldOuts and cycleBreaker come in. To figure oldIns, we look at all edges coming in to $Johnsaw. There’s only one: ROOT. So we look up (ROOT → $Johnsaw) in ep and find that it maps to saw. So we should restore the edge (ROOT → saw).

Next we look at outgoing edges. Again, there’s only one, to Mary. We look up ($Johnsaw → Mary) and find that we should restore (saw → Mary).

Now we put back in the original nodes and here’s what we have:

This is great. Just great. Except the cycle is there! Look: (John → saw) and (saw → John). So we need to fix that before we’re done. That’s where cycleBreaker comes in:

cycleBreaker = map (snd & \ wj -> Set.filter (snd & (==wj)) (edges cyc))
                   (Set.toList oldIns) |> Set.unions |> Set.toList

In cycleBreaker, we construct edges in the cycle that could be removed. First of all, you can’t remove an edge leaving a cycle node that isn’t pointed to from the outside. Look at (saw → John): if you removed that, there’d be no way to get from ROOT down to John. So you have to look only at the nodes pointed to from the outside (the endpoints of oldIns) and then find the cycle edge entering that node.

For our example, there’s only one item in oldIns: (ROOT → saw). So mapping snd over it calls the lambda on only saw. Then we filter the edges of the cycle and find which one points to saw (ie the one whose (snd & (=="saw")). That produces (John → saw).

Finally, the lowest-weight edge from Set.unions cycleBreaker is removed. In our case, cycleBreaker is the Haskell value [Set.fromList [(John, saw)]]. So you can see that minimum is not right since that finds the minimum edge as tuple, not the minimum-weight edge. In fact, we should do something like argmax (negate . (weights g !)).

Minutes later: Yes, that works. You should probably change your code accordingly.

And we’re done! This is the desired output and the algorithm has finished.


Post-script on cycle

I was going to run an example through cycle but I don’t feel like it. I leave it as an exercise for the reader. Or better yet, replace my cycle with a better one.

Instead, here is my test code. It requires HUnit, but even if you don’t have that installed, there is lots of code dedicated to testing cycle because I wasn’t sure if it would work.


Next time: Python version with identical commentary ripped straight from the Haskell version! I will see which post gets more hits!

29/01/10

Permalink 01:55:25 pm, 2799 words
Categories: Code, Linguistics

Implementing the Chu-Liu-Edmonds parser

This presentation is taken from discussion from the parsing read group in the Indiana University computational linguistics group, and from the book my advisor Sandra Kuebler co-authored, Dependency Parsing. I haven’t read the rest of the book since it came out while I was in Seattle, but I can only assume that it’s a great place to learn about dependency parsing.

Once upon a time, I believed that if I could implement an algorithm, I would understand it. Grad school made me grow out of the assumption—I’ve implemented several algorithms that I didn’t understand. Still, implementing an algorithm helps me understand it better. Blogging about it improves my understanding even more. So over Christmas I implemented the Chu-Liu-Edmonds parser. It’s a parser co-developed by one of the other authors of Dependency Parsing (who, surprise surprise, now works at Google). The algorithm is very cool, and not terribly difficult to implement. However, I don’t completely understand it, and I’ve been sitting on this post for several weeks. So I think I’m just going to post what I have and if you are curious, you can read the original paper for a good explanation.

Note: if my layout and your browser conspire to make the viewable code width less than 80 characters, try viewing the complete file in a separate window: ChuLiuEdmonds.hs.


The Chu-Liu-Edmonds parser is a dependency parser that works in a very clever way. It frames the parsing problem as a graph problem: start with a fully connected graph and recursively remove the “right” edges until you are left with a tree. This approach allows you to take advantage of efficient graph algorithms already developed by computer scientists. No need to roll your own code. Actually, the same thing is happening in phonology as it switches from Optimality Theory (OT) to the more powerful Harmonic Grammar (HG). OT doesn’t map well to existing algorithms, so the custom ones linguists have made are not any faster than the HG ones, because HG is basically just a linear programming problem.

Anyway, the real question is what are the “right” edges to be removed. The starting edges are estimates of the likelihood of a dependency between two words. So the edge between the subject and the predicate should be much more likely than the edge between the subject and a preposition much later in the sentence. The best edges are the ones that contribute overall to the highest-likelihood tree.

So that gives us our terminal case: if the graph is acyclic, then it’s a tree, and if we have recursively removed only the least likely edges, then it’s the most likely parse. For the recursive case, where the graph has cycles, then the cycles have to be broken in the optimal way. Note: “the graph” refers to a particular greedy transformation of the graph, specified below.

The recursive case is where things get tricky: one cycle is broken at each recursive step. I understand how one cycle is broken, but I’m not sure I understand how it recursively breaks more than one. I do know that the basic idea is to:

  1. Find a cycle, any cycle.
  2. Remove all nodes of the cycle and stick in a placeholder node that says “There is a cycle here; I’ll break it once all the other cycles are broken.”
  3. Recur: find a cycle, any cycle…
  4. If you can’t find a cycle, you hit the terminal case. Return the current graph (well, the greedy transformation thereof).
  5. Now it’s time to break that cycle. Remove the least likely edge.
  6. Remove the placeholder node from the graph and put back in all the nodes from the now-broken cycle. Return.

Actually, at a high level, I think I get it. It’s the low-level that is so confusing. Let me start presenting the code and see if that helps.

I’ll start with the graph data structure. This is a naive graph implementation that closely apes the pseudocode given in the book. There is a footnote that mentions that a O(n2) bound is possible, given clever graph data structures and algorithms. This implementation is not clever.

type Vertices = Set.Set String
type Edges = Set.Set (String,String)
type Weights = Map.Map (String,String) Double
data Graph = Graph { vertices :: Vertices
                   , edges :: Edges
                   , weights :: Weights } deriving (Eq)

This is not the typical adjacency list implementation of a graph; instead, the edges are stored as lists of pairs (fromEdge, toEdge). This makes it easier to get all edges entering a vertex as well as leaving it. But not faster, both are equally slow this way. weights stores the edges redundantly because of the way the graph is copied during the recursion; weights may contain weights for edges that are not in the current graph, but were used in an earlier version before some cycles had edges removed. (I think it’s possible to avoid this redundancy, but I haven’t convinced myself of it.)

Then there are some functions that operate on Graphs. Since I stuck everything in the same module, I’m pretty sure this is not a water-tight Graph interface, but it’s close.

weightTo g wj wi = (wi,wj) `Map.lookup` weights g |> fromMaybe (-1)
weightFrom g wi wj = (wi,wj) `Map.lookup` weights g |> fromMaybe (-1)
edgeTo g wj wi = (wi,wj) `Set.member` edges g
edgeFrom g wi wj = (wi,wj) `Set.member` edges g
verticesW = vertices & Set.delete "ROOT" -- just the word vertices
verticesL = vertices & Set.toList -- convert to list
kids g w = Set.filter (g `edgeFrom` w) (vertices g)
parents g w = Set.filter (g `edgeTo` w) (vertices g)

weightTo/From and edgeTo/From are intended to be used curried, usually like you see in kids:

Set.filter (g `edgeFrom` w) (vertices g)

This gives you all of g’s vertices that the vertex w points to. (fromMaybe (-1) and verticesW are needed because of irregularities resulting from the way I included the ROOT in the graph.)

After the data structure, we start with the actual parser. First is a convenience function cryptically called h in the pseudocode. I think “h” stands for “parse". Yeah. Maybe it’s a German abbreviation.

h :: [String] -> Weights -> Graph
h s lambdas = cle (Graph (Set.fromList s)
                         (Set.fromList $ liftM2 (,) s s)
                         lambdas)

h really doesn’t tell you much except (1) how to construct a Graph given a sentence and (2) cle is the actual parser.

cle :: Graph -> Graph
cle g = g { edges = greedyArc g } |> ChuLiuEdmonds.cycle |> decycle g
greedyArc :: Graph -> Edges
greedyArc g = Set.map (\ wj -> (argmax (g `weightTo` wj) (verticesL g),wj))
                      (verticesW g)

cle operates in three steps. First, it runs greedyArc, which throws out all but the most likely edge that enters each vertex, which is the definition of “most likely overall sentence", I guess because it most closely approximates a tree originating at ROOT, since ROOT has no edges entering it. cycle checks this altered graph for cycles. Then it hands the graph off to decycle, which checks for the base case—if no cycles, then the altered graph is a tree, and hence the answer. Otherwise decycle removes a cycle from the original and recurs on cle. In fact, let’s look at decycle right now. By the way, if you’re not used to reading my code, (|>) is flip $ and (&) is flip (.). They facilitate left-to-right reading.

decycle :: Graph -> Maybe Graph -> Graph
decycle g Nothing = g { edges = greedyArc g }
decycle g (Just cyc) = -- Note: should Set.minBy weight not edge's alphabet sort
  g' { edges = (Set.findMin $ Set.unions cycleBreaker) `Set.delete`
               Set.unions [edges g',edges cyc,oldOuts,oldIns] }
  where (gc,wc,ep) = contract g (Set.toList $ vertices cyc)
        g' = cle gc
        oldIns = g' `parents` wc |> Set.map (\ wi -> (wi,ep ! (wi,wc)))
        oldOuts = g' `kids` wc |> Set.map (\ wj -> (ep ! (wc,wj), wj))
        cycleBreaker = map (snd & \ wj -> Set.filter (snd & (==wj)) (edges cyc))
                           (Set.toList oldIns)

The first parameter to decycle is the graph and the second is the cycle, if any. Like I said, if there is no cycle, then the parser is done; the answer is the most likely tree I described in the previous paragraph. Otherwise, the answer is the graph obtained by a recursive call to cle, except for one edge: the least likely edge in the set cycleBreaker.

The tricky part is in the where-clause leading up to the definition of cycleBreaker. Three things happen here:

  1. We call contract to swap out the cycle for a substitute node. This new graph is called gc and the substitute node is called wc. ep holds all the original edges to and from all the nodes we just swapped out.
  2. We call cle recursively with gc to break the rest of the cycles. cle is magic*, so the g’ that it returns is perfect except that we have to break the current cycle and put all its nodes back in.
  3. To do that, we restore edges to g’; the nodes never went anywhere; they were just unreachable from the rest of the graph. We get oldIns and oldOuts by looking them up in ep. Then, cycleBreaker is, for each cycle node with an outside edge pointing to it, all the edges that enter that node. Only these cycle edges are candidates for removal. I don’t understand why, but there it is. Maybe because removing edges from a cycle-internal node would leave some node in the cycle unreachable?

Anyway, with all this backwork done, decycle returns the cleaned-up g’, plus restoring all the edges of the cycle (edges cyc) and all the edges entering (oldIns) and leaving (oldOuts) the cycle. Except for one of the edges in cycleBreaker. The edge is the minimum edge, which right now is the alphabetically minimum one. That’s wrong, it should be minBy the weight not the edge’s endpoint’s names. I haven’t had time to fix it yet.

OK, only one more core function to go: contract. All contract does is substitute a temporary node for a cycle and record all the information needed to put the cycle back in later. However, it’s surprisingly complicated, partly (mostly?) because the weights of the new edges have to be calculated.

contract :: Graph -> [String] -> (Graph,String,Map.Map (String,String) String)
contract g c = (gc,wc,ep)
  where gc = Graph (wc `Set.insert` outside)
                   (outsideEdges `Set.union` Map.keysSet ep)
                   lambdas
        wc = gensym c
        outside = vertices g `Set.difference` Set.fromList c
        outsideEdges = Set.filter (\ (x,y) -> not (x `elem` c || y `elem` c))
                                  (edges g)
        -- ep => endpoints. It's a mapping from arcs to
        -- the new virtual arc back to the node that it originally pointed at
        ep = Map.fromList (insEp ++ outsEp)
        lambdas = Map.unions [weights g, Map.fromList outsW, Map.fromList insW]
        outsEp = [((wc,wj),argmax (g `weightTo` wj) c)
                 | wj <- Set.toList outside, any (g `edgeTo` wj) c]
        insEp = [((wi,wc),argmax (maxCycle wi) c)
                | wi <- Set.toList outside, any (g `edgeFrom` wi) c]
        outsW = [((wc,wj), weights g ! (wi,wj)) | ((wc,wj),wi) <- outsEp]
        insW = [((wi,wc), maxCycle wi wj + cycleWeight) | ((wi,wc),wj) <- insEp]
          where cycleWeight = sum [weights g ! (ancestors ! w, w) | w <- c]
        maxCycle wi wj = weightFrom g wi wj - weightFrom g (ancestors ! wj) wj
        ancestors = Map.fromList $ zip (tail $ Prelude.cycle c) c

All right, this is kind of confusing, and I’m not sure I’ve organised it in the best way either. The main return value is gc, but it takes three important pieces of information: the vertices, the edges and the edge weights. The vertices, outside, are all the vertices outside the cycle. That’s not too complicated, just vertices g `Set.difference` Set.fromList c. Next, outsideEdges is all edges that don’t have an endpoint that is (`elem` c). (Plus the edges in ep, more on that in a second). Finally, lambdas is complicated because of the weights of the new edges.

So the most confusing part of the function is calculation of ep and lambdas. ep contains two type of edges: outsEp, the best edge from any node in the cycle to some outside edge, and insEp is the best edge to any node in the cycle. lambdas (outsW and insW) are the weights for these edges, which, even more confusingly, are not the same as the original weights!

Let’s step back a little. The purpose of contract is to substitute the cycle with a single node so that recurring calls to cle can eliminate the rest of the cycles. So all the edges that went in and out of the cycles’ nodes now have to go in and out of that new, single node. However, what happens if we have two edges (cycle-A, outside-C) and (cycle-B, outside-C)? We definitely want an edge (substitute-AB, outside-C), but which one should we use, cycle-A’s or cycle-B’s? And what should the weight be? Note: we can’t just keep both and call them BOTH (substitute-AB, outside-C) (substitute-AB, outside-C), at least not on a normal graph.

The way that contract solves this is to choose the edge that has the highest weight. So between (cycle-A, outside-C): 20 and (cycle-B, outside-C): 30, (cycle-B, outside-C) wins. outsEp records this as the pair ((substitute-AB,outside-C), cycle-B) so that decycle can later restore the edge (cycle-B, outside-C).

Now what about the weight? Well, for edges leaving the cycle, it’s pretty simple, it’s just the original weight—(cycle-B, outside-C)’s weight in our example. For edges entering the cycle, it’s more complicated. In fact, the weight is the weight of the original edge, plus the weight of reaching every other node in the cycle. That’s where cycleWeight comes in—for insW, the final weight is the weight of the edge in, plus the weight of going all around the cycle (cycleWeight), minus the last edge’s weight. The last edge’s weight gets left out because you just need to touch every node in the cycle, which means you stop before returning to the entry point.

One last thing: because you need to subtract that last edge’s weight in insW, you have to construct a list of the predecessor to each node in the cycle. That way, given a starting point in the cycle, you can remember what the other end is without having to walk all the way around the cycle each time. That’s what ancestors does: it walks around the cycle once, storing the endpoint for every starting point in the cycle.


I’ll finish with some utilities. I won’t describe them as closely, partly because their names should tell you almost everything about them.

-- specialised utilities --
gensym :: [String] -> String
gensym = ('$':) . concat -- HA HA
cycle :: Graph -> Maybe Graph
cycle g@(Graph vs _ _) = loop "ROOT" (Set.singleton "ROOT")
  where loop w total = case runState (dfs [] w) total of
          (Just cycle, _) -> Just cycle
          (Nothing, total) | vs == total -> Nothing
          (Nothing, total) -> loop (Set.findMin (Set.difference vs total)) total
        dfs seen w
          | w `elem` seen = return
              (Just $ g {vertices = Set.fromList $ w : takeWhile (/=w) seen})
          | otherwise = do
              put . (Set.union $ kids g w) =<< get
              return . msum =<< mapM (dfs $ w:seen) (Set.toList $ kids g w)
-- general utilities --
x |> f = f x -- flip ($) (Try F#, A Very Nice Functional Programming Language)
f & g = g . f -- flip (.)
maximumBy _ [] = error "empty list"
maximumBy f (x:xs) = loop x (f x) xs
  where loop bestVal _ [] = bestVal
        loop bestVal best (x:xs) | f x > best = loop x (f x) xs
                                 | otherwise  = loop bestVal best xs
argmax = maximumBy

Well, if you’re a Lisp hacker. gensym is a name mangler. I believe that it won’t cause name collisions in the presence of recursion, but I haven’t tested it. It might need a trailing symbol too.

cycle’s function is straightforward—detect and return cycles in a graph—but I wrote the algorithm myself for this funny graph representation, so it’s probably not optimal. It’s based on Cormen, Lieserson and Rivest’s depth-first-search. Also, there are some ridiculous shenanigans with the State monad to remember what nodes have been seen so far. Also also, I don’t use Sets as widely as you might expect because it turns out that I need the cycle to be ordered in order to construct the ancestors list in contract.

maximumBy is the Haskell-esque name for argmax, but I used argmax for this code because I kinda-mostly stuck with the original names for the code. (Also it’s shorter.) I don’t apologise for the F# functions at all.


Next time, a worked example! Then after that, some tests and a Python version of the code! Which may just be a code dump since I don’t want to write about it!

Please report errors and improvements in this version. This is such a long post that I’m sure I messed up something.

*Recursive.

Permalink 11:06:11 am, 67 words
Categories: Code

Learning F# mentally wise

This is just a stand-in post until I finish the Epic Code Explanation of a Chu-Liu-Edmonds parser implementation in Haskell.

Sometimes, questions on Stack Overflow are unintentionally funny. Such as: How would you approach mentally wise in learning F#.

The best answer? “Get baked out of your mind before starting. (Not sure that actually worked though.)”

Although, to be fair, that answer was for Lisp, not F#.

21/01/10

Permalink 09:11:46 am, 1197 words
Categories: Games

LostWinds : Winter of Melodias

As a programmer, I use the power of indirection all the time. While indirection grants you flexibility, it also makes things less efficient. That’s the story of LostWinds’ innovation: instead of a Jump button, you get a Swish button that makes the wind blow. You can Swish your player to jump, but you can also Swish enemies, items and scenery to do lots of other things. LostWinds replaces several special-purpose buttons with one general mechanism. That’s the elegance of indirection.

I wish it worked in LostWinds. It doesn’t, really. It seems like replacing one button with one gesture would get you the simplicity of Mario’s 1-button jump with the flexibility of Metal Gear’s 8-button CQC system. Well, you kind of get that…about 33% of the time. Another 1/3 of the time, the game misinterprets your gesture and the last 1/3, the Wii fails to register your gesture at all.

LostWinds is a platformer that is mostly about puzzles instead of action. It tries very hard to exude an indy feel: there’s a little direct story-telling but most of it is picked up along the way or, even better, exuded by sheer force of personality. There are numerous variations to the gameplay in a short amount of time. In this respect, it’s a good game. If you are interested in playing, though, you should start with the first of the two games. The second game assumes that you have already played the first.

This assumption is a problem because the learning curve on basic jumping is huge—3-4 hours for me—and since the designers assume that you have already learned how to platform in the first game, the first hour of the game plops a timer on you as you platform from safe point to safe point.

For the beginner, this is supremely frustrating in many highly innovative ways. The timer gives you about enough time to make two mistakes on the way to the next safe point. After the first mistake, you have to decide whether to keep trying. If you opt to go back to the last safe point and start over, you spend extra time. If you try a second time, the Wiimote starts buzzing madly as the timer continues to run down. But if you’re like me and only start up the Wii once a month, that means your Wiimote battery is now running down almost as fast as your timer. So you run up to the ledge, make the attempt, fail it, and run back to the safe point. Repeat. Repeat. Repeat. Add 5 seconds of dead time between each attempt to learn how to platform in this stupid game.

If you’re going to make players suffer like this, don’t put it at the beginning of the game! Do like Metroid’ 2 (and Mario World, and Mario Galaxy…) and at least put it partway in, after the new players have a chance to figure out the controls.

Even after I mastered the jumping, there were new and exciting ways to fail. There is a section when you have to lift 3 items to the top of the screen. These items have two salient points: (1) they are round and (2) they burn. So, unlike Our Hero, they roll around when flung upwards onto tiny platforms. Have you played the awesome game Population Tire? Imagine that 3 screens taller, made a tiny bit easier by having intermediate platforms. Except that, if you Swish too close to a torch, the tire burns up and you have to start over at the bottom. Actually, I encourage you to play Population Tire and then play the Wii version. Even though the Wii version is considerably slowed down, it’s still harder than using a mouse.

I finally learned (from watching a video walkthrough) that you can use the Guide technique to lift non-Hero items gently along a path, which minimises rolling at the end. That way you just have to worry about the fire.

A final technique that I never got good at was drawing circles. I don’t know if it’s a problem with how I hold the Wiimote, or how the software recognises the circles, or maybe just a mismatch with human physiology, but drawing circles is NOT easy. Drawing polygons is easy enough, but they’re never really round. More…squished. I have this problem in Okami too, where circles are just as important, although at least Okami pauses completely while you’re drawing instead of going into slow-motion.

So I didn’t like the gameplay. Unlike Banjo-Kazooie, which also innovated its way into an annoying game, I have a harder time disliking the game as a whole. It helps that outside gameplay, the presentation and story-telling are appealing and cute. I think the more important reason is that the innovation here isn’t inherently a bad idea (unlike “vehicle-based platformer"). It’s just crippled by a crappy input device and slightly dodgy recognition. This game would be so much better with a mouse, it’s not even funny. Seriously. I really hope it’s ported to PC at some point. If it were not a downloadable game, even 360/PS3 ports would be feasible because they could just pack in a $5 USB mouse.

One important point I take away from this is that I am just about done with the Wii. The platform, like most winning crappy-grey-box platforms, is short on inherent value compared to alternatives–good software is good in spite of the platform, not because of it. But there’s not much good software on the Wii, at least compared to 3 years into the PS2’s life. In other words, I really wish I had bought a Wii at the end of its lifetime like I did the PS2. That way I could buy only the really good games and not have to suffer through the gimmicky or unusable innovations. As I think back over every game that has used the Wiimote well so far, they would ALL (all!) be better with mouse control*. The things they do are cool, but the Wiimote is so finicky and inaccurate that in all cases, there is (1) a huge training time and (2) I am MORE likely to quit the game out of annoyance.

I’m not sure what my final opinion of the game is; I think I’d like it a lot better after a second playthrough. Maybe it’s just one of those games you have to play more than once. One thing I’m sure about: this is the last time I buy a game based solely on the Brainy Gamer’s recommendation. Our tastes are just too different. I think Michael is the typical arts professor: he values certain things highly, like emotion and innovation, and has a huge capacity to enjoy a game despite its problems. As for me, I don’t value innovation nearly as much, and I can’t ignore problems as well.

*In case you’re wondering, the list is: World of Goo, LostWinds, Metroid’ 3 and Okami.

But there’s a another list of games for which Wiimote control is peripheral. Mario Galaxy and Zelda: Twilight Princess wouldn’t lose much on a Gamecube controller except arm fatigue. And games like Fire Emblem and Smash Brothers encourage you to use a Gamecube controller.

1 2 3 4 5 6 7 8 9 10 11 ... 58 >>

Nathan Sanders : Journal

powered by b2evolution free blog software