| « Chu-Liu-Edmonds parser implementation : a worked example | Learning F# mentally wise » |
This presentation is taken from discussion from the parsing read group in the Indiana University computational linguistics group, and from the book my advisor Sandra Kuebler co-authored, Dependency Parsing. I haven’t read the rest of the book since it came out while I was in Seattle, but I can only assume that it’s a great place to learn about dependency parsing.
Once upon a time, I believed that if I could implement an algorithm, I would understand it. Grad school made me grow out of the assumption—I’ve implemented several algorithms that I didn’t understand. Still, implementing an algorithm helps me understand it better. Blogging about it improves my understanding even more. So over Christmas I implemented the Chu-Liu-Edmonds parser. It’s a parser co-developed by one of the other authors of Dependency Parsing (who, surprise surprise, now works at Google). The algorithm is very cool, and not terribly difficult to implement. However, I don’t completely understand it, and I’ve been sitting on this post for several weeks. So I think I’m just going to post what I have and if you are curious, you can read the original paper for a good explanation.
Note: if my layout and your browser conspire to make the viewable code width less than 80 characters, try viewing the complete file in a separate window: ChuLiuEdmonds.hs.
The Chu-Liu-Edmonds parser is a dependency parser that works in a very clever way. It frames the parsing problem as a graph problem: start with a fully connected graph and recursively remove the “right” edges until you are left with a tree. This approach allows you to take advantage of efficient graph algorithms already developed by computer scientists. No need to roll your own code. Actually, the same thing is happening in phonology as it switches from Optimality Theory (OT) to the more powerful Harmonic Grammar (HG). OT doesn’t map well to existing algorithms, so the custom ones linguists have made are not any faster than the HG ones, because HG is basically just a linear programming problem.
Anyway, the real question is what are the “right” edges to be removed. The starting edges are estimates of the likelihood of a dependency between two words. So the edge between the subject and the predicate should be much more likely than the edge between the subject and a preposition much later in the sentence. The best edges are the ones that contribute overall to the highest-likelihood tree.
So that gives us our terminal case: if the graph is acyclic, then it’s a tree, and if we have recursively removed only the least likely edges, then it’s the most likely parse. For the recursive case, where the graph has cycles, then the cycles have to be broken in the optimal way. Note: “the graph” refers to a particular greedy transformation of the graph, specified below.
The recursive case is where things get tricky: one cycle is broken at each recursive step. I understand how one cycle is broken, but I’m not sure I understand how it recursively breaks more than one. I do know that the basic idea is to:
Actually, at a high level, I think I get it. It’s the low-level that is so confusing. Let me start presenting the code and see if that helps.
I’ll start with the graph data structure. This is a naive graph implementation that closely apes the pseudocode given in the book. There is a footnote that mentions that a O(n2) bound is possible, given clever graph data structures and algorithms. This implementation is not clever.
type Vertices = Set.Set String
type Edges = Set.Set (String,String)
type Weights = Map.Map (String,String) Double
data Graph = Graph { vertices :: Vertices
, edges :: Edges
, weights :: Weights } deriving (Eq)
This is not the typical adjacency list implementation of a graph; instead, the edges are stored as lists of pairs (fromEdge, toEdge). This makes it easier to get all edges entering a vertex as well as leaving it. But not faster, both are equally slow this way. weights stores the edges redundantly because of the way the graph is copied during the recursion; weights may contain weights for edges that are not in the current graph, but were used in an earlier version before some cycles had edges removed. (I think it’s possible to avoid this redundancy, but I haven’t convinced myself of it.)
Then there are some functions that operate on Graphs. Since I stuck everything in the same module, I’m pretty sure this is not a water-tight Graph interface, but it’s close.
weightTo g wj wi = (wi,wj) `Map.lookup` weights g |> fromMaybe (-1) weightFrom g wi wj = (wi,wj) `Map.lookup` weights g |> fromMaybe (-1) edgeTo g wj wi = (wi,wj) `Set.member` edges g edgeFrom g wi wj = (wi,wj) `Set.member` edges g verticesW = vertices & Set.delete "ROOT" -- just the word vertices verticesL = vertices & Set.toList -- convert to list kids g w = Set.filter (g `edgeFrom` w) (vertices g) parents g w = Set.filter (g `edgeTo` w) (vertices g)
weightTo/From and edgeTo/From are intended to be used curried, usually like you see in kids:
Set.filter (g `edgeFrom` w) (vertices g)
This gives you all of g’s vertices that the vertex w points to. (fromMaybe (-1) and verticesW are needed because of irregularities resulting from the way I included the ROOT in the graph.)
After the data structure, we start with the actual parser. First is a convenience function cryptically called h in the pseudocode. I think “h” stands for “parse". Yeah. Maybe it’s a German abbreviation.
h :: [String] -> Weights -> Graph
h s lambdas = cle (Graph (Set.fromList s)
(Set.fromList $ liftM2 (,) s s)
lambdas)
h really doesn’t tell you much except (1) how to construct a Graph given a sentence and (2) cle is the actual parser.
cle :: Graph -> Graph
cle g = g { edges = greedyArc g } |> ChuLiuEdmonds.cycle |> decycle g
greedyArc :: Graph -> Edges
greedyArc g = Set.map (\ wj -> (argmax (g `weightTo` wj) (verticesL g),wj))
(verticesW g)
cle operates in three steps. First, it runs greedyArc, which throws out all but the most likely edge that enters each vertex, which is the definition of “most likely overall sentence", I guess because it most closely approximates a tree originating at ROOT, since ROOT has no edges entering it. cycle checks this altered graph for cycles. Then it hands the graph off to decycle, which checks for the base case—if no cycles, then the altered graph is a tree, and hence the answer. Otherwise decycle removes a cycle from the original and recurs on cle. In fact, let’s look at decycle right now. By the way, if you’re not used to reading my code, (|>) is flip $ and (&) is flip (.). They facilitate left-to-right reading.
decycle :: Graph -> Maybe Graph -> Graph
decycle g Nothing = g { edges = greedyArc g }
decycle g (Just cyc) = -- Note: should Set.minBy weight not edge's alphabet sort
g' { edges = (Set.findMin $ Set.unions cycleBreaker) `Set.delete`
Set.unions [edges g',edges cyc,oldOuts,oldIns] }
where (gc,wc,ep) = contract g (Set.toList $ vertices cyc)
g' = cle gc
oldIns = g' `parents` wc |> Set.map (\ wi -> (wi,ep ! (wi,wc)))
oldOuts = g' `kids` wc |> Set.map (\ wj -> (ep ! (wc,wj), wj))
cycleBreaker = map (snd & \ wj -> Set.filter (snd & (==wj)) (edges cyc))
(Set.toList oldIns)
The first parameter to decycle is the graph and the second is the cycle, if any. Like I said, if there is no cycle, then the parser is done; the answer is the most likely tree I described in the previous paragraph. Otherwise, the answer is the graph obtained by a recursive call to cle, except for one edge: the least likely edge in the set cycleBreaker.
The tricky part is in the where-clause leading up to the definition of cycleBreaker. Three things happen here:
Anyway, with all this backwork done, decycle returns the cleaned-up g’, plus restoring all the edges of the cycle (edges cyc) and all the edges entering (oldIns) and leaving (oldOuts) the cycle. Except for one of the edges in cycleBreaker. The edge is the minimum edge, which right now is the alphabetically minimum one. That’s wrong, it should be minBy the weight not the edge’s endpoint’s names. I haven’t had time to fix it yet.
OK, only one more core function to go: contract. All contract does is substitute a temporary node for a cycle and record all the information needed to put the cycle back in later. However, it’s surprisingly complicated, partly (mostly?) because the weights of the new edges have to be calculated.
contract :: Graph -> [String] -> (Graph,String,Map.Map (String,String) String)
contract g c = (gc,wc,ep)
where gc = Graph (wc `Set.insert` outside)
(outsideEdges `Set.union` Map.keysSet ep)
lambdas
wc = gensym c
outside = vertices g `Set.difference` Set.fromList c
outsideEdges = Set.filter (\ (x,y) -> not (x `elem` c || y `elem` c))
(edges g)
-- ep => endpoints. It's a mapping from arcs to
-- the new virtual arc back to the node that it originally pointed at
ep = Map.fromList (insEp ++ outsEp)
lambdas = Map.unions [weights g, Map.fromList outsW, Map.fromList insW]
outsEp = [((wc,wj),argmax (g `weightTo` wj) c)
| wj <- Set.toList outside, any (g `edgeTo` wj) c]
insEp = [((wi,wc),argmax (maxCycle wi) c)
| wi <- Set.toList outside, any (g `edgeFrom` wi) c]
outsW = [((wc,wj), weights g ! (wi,wj)) | ((wc,wj),wi) <- outsEp]
insW = [((wi,wc), maxCycle wi wj + cycleWeight) | ((wi,wc),wj) <- insEp]
where cycleWeight = sum [weights g ! (ancestors ! w, w) | w <- c]
maxCycle wi wj = weightFrom g wi wj - weightFrom g (ancestors ! wj) wj
ancestors = Map.fromList $ zip (tail $ Prelude.cycle c) c
All right, this is kind of confusing, and I’m not sure I’ve organised it in the best way either. The main return value is gc, but it takes three important pieces of information: the vertices, the edges and the edge weights. The vertices, outside, are all the vertices outside the cycle. That’s not too complicated, just vertices g `Set.difference` Set.fromList c. Next, outsideEdges is all edges that don’t have an endpoint that is (`elem` c). (Plus the edges in ep, more on that in a second). Finally, lambdas is complicated because of the weights of the new edges.
So the most confusing part of the function is calculation of ep and lambdas. ep contains two type of edges: outsEp, the best edge from any node in the cycle to some outside edge, and insEp is the best edge to any node in the cycle. lambdas (outsW and insW) are the weights for these edges, which, even more confusingly, are not the same as the original weights!
Let’s step back a little. The purpose of contract is to substitute the cycle with a single node so that recurring calls to cle can eliminate the rest of the cycles. So all the edges that went in and out of the cycles’ nodes now have to go in and out of that new, single node. However, what happens if we have two edges (cycle-A, outside-C) and (cycle-B, outside-C)? We definitely want an edge (substitute-AB, outside-C), but which one should we use, cycle-A’s or cycle-B’s? And what should the weight be? Note: we can’t just keep both and call them BOTH (substitute-AB, outside-C) (substitute-AB, outside-C), at least not on a normal graph.
The way that contract solves this is to choose the edge that has the highest weight. So between (cycle-A, outside-C): 20 and (cycle-B, outside-C): 30, (cycle-B, outside-C) wins. outsEp records this as the pair ((substitute-AB,outside-C), cycle-B) so that decycle can later restore the edge (cycle-B, outside-C).
Now what about the weight? Well, for edges leaving the cycle, it’s pretty simple, it’s just the original weight—(cycle-B, outside-C)’s weight in our example. For edges entering the cycle, it’s more complicated. In fact, the weight is the weight of the original edge, plus the weight of reaching every other node in the cycle. That’s where cycleWeight comes in—for insW, the final weight is the weight of the edge in, plus the weight of going all around the cycle (cycleWeight), minus the last edge’s weight. The last edge’s weight gets left out because you just need to touch every node in the cycle, which means you stop before returning to the entry point.
One last thing: because you need to subtract that last edge’s weight in insW, you have to construct a list of the predecessor to each node in the cycle. That way, given a starting point in the cycle, you can remember what the other end is without having to walk all the way around the cycle each time. That’s what ancestors does: it walks around the cycle once, storing the endpoint for every starting point in the cycle.
I’ll finish with some utilities. I won’t describe them as closely, partly because their names should tell you almost everything about them.
-- specialised utilities --
gensym :: [String] -> String
gensym = ('$':) . concat -- HA HA
cycle :: Graph -> Maybe Graph
cycle g@(Graph vs _ _) = loop "ROOT" (Set.singleton "ROOT")
where loop w total = case runState (dfs [] w) total of
(Just cycle, _) -> Just cycle
(Nothing, total) | vs == total -> Nothing
(Nothing, total) -> loop (Set.findMin (Set.difference vs total)) total
dfs seen w
| w `elem` seen = return
(Just $ g {vertices = Set.fromList $ w : takeWhile (/=w) seen})
| otherwise = do
put . (Set.union $ kids g w) =<< get
return . msum =<< mapM (dfs $ w:seen) (Set.toList $ kids g w)
-- general utilities --
x |> f = f x -- flip ($) (Try F#, A Very Nice Functional Programming Language)
f & g = g . f -- flip (.)
maximumBy _ [] = error "empty list"
maximumBy f (x:xs) = loop x (f x) xs
where loop bestVal _ [] = bestVal
loop bestVal best (x:xs) | f x > best = loop x (f x) xs
| otherwise = loop bestVal best xs
argmax = maximumBy
Well, if you’re a Lisp hacker. gensym is a name mangler. I believe that it won’t cause name collisions in the presence of recursion, but I haven’t tested it. It might need a trailing symbol too.
cycle’s function is straightforward—detect and return cycles in a graph—but I wrote the algorithm myself for this funny graph representation, so it’s probably not optimal. It’s based on Cormen, Lieserson and Rivest’s depth-first-search. Also, there are some ridiculous shenanigans with the State monad to remember what nodes have been seen so far. Also also, I don’t use Sets as widely as you might expect because it turns out that I need the cycle to be ordered in order to construct the ancestors list in contract.
maximumBy is the Haskell-esque name for argmax, but I used argmax for this code because I kinda-mostly stuck with the original names for the code. (Also it’s shorter.) I don’t apologise for the F# functions at all.
Next time, a worked example! Then after that, some tests and a Python version of the code! Which may just be a code dump since I don’t want to write about it!
Please report errors and improvements in this version. This is such a long post that I’m sure I messed up something.
*Recursive.