| « Python's iterators are a bad implementation of laziness : with concrete example | Wednesday is Backwards Day » |
Well, after the depressing episode with iterators yesterday, it feels kind of weird to show the other end of s-expressions today: parsing. I normally would not write this in Python, but I need to parse s-expressions into trees so I can use an existing Python library to process it. Since I was sort of on vacation for the last couple of weeks, I spent way more time than necessary building monadic parser combinators to do the parsing.
As always, I feel obligated to link to a more serious implementations of monadic parser combinators in Python in case you really want to use them yourself.
This is a gratuitous explanation of monadic parsing in Python. I’m on vacation, so I decided to write my own s-expression parser instead of finding some code or using Python’s equivalent of yacc. Here’s the s-expression grammar I intended to use:
However, I didn’t come up with a good way to implement +, or to integrate regexes into my system. So here’s the grammar I actually used:
This should work fine for the sentences I need to parse, for example this one:
test2 ="" " (S (VVPS POPPHH) (VNDD HVPS) (VVIV VVSN) (NP (CONJP (NN PR) (IK++ PN) (NN PR) (IK++ NNDDSS) (NN PORP) (IK++ POOP) (NN VVPT) (RO PR)) (NN__SS PODP)) (NP (PN____GG NNDD) (NN IP))) """
The output of the parser will be a tuple tree, which in Haskell would have the following type:
data TupleTree a = TupleTree (a, [TupleTree a])
Since Python doesn’t support pattern matching, only having one constructor is a nice feature. Leaves are distinguished by having no children.
So anyway. I’m going to implement monadic parser combinators in Python; if you don’t know what those are, stick around—they are a lot simpler than they sound. The word ‘monadic’ means that the code uses a specific interface to do its work, one that is (1) purely functional (no side effects) and (2) easy to wrap up in a nice interface. The word ‘parser’ is obvious (I hope), and the word ‘combinator’ just means that simple parsers can be combined with operators like OR to produce complex parsers.
The simplest parser is skipspace. It just skips spaces without returning them.
def skipspace(s, i):
while i < len (s) and s[i].isspace():
i += 1
return (s, i), ()
As you can see, I have defined a parser as a function that takes a string and an index and returns the string, a new index, and a value. Since skipspaces doesn’t produce a value, it returns the empty tuple for its value. Of *course*, since this is Python, you could choose to model all this stuff as classes. I will eventually, but it adds a lot of boilerplate to the implementation. Unfortunately, this means that you’ll have to understand nested functions in the meantime.
def char(c):
def parse(s, i):
if i >= len(s):
return (s,i), Exception("Ran off end of string")
if s[i]==c:
return (s, i+1), c
else:
return (s,i), Exception("'%s' expected at %d, but '%s' was found instead. [%s]" %
(c,i,s[i], s[max(0, i-5):min(len(s), i+5)]))
return parse
def atom(s, i):
if i >= len(s): return (s,i), Exception("Ran off end of string")
start = i
while i < len (s) and not s[i].isspace() and not s[i]==')' and not s[i]=='(':
i += 1
if i==start:
return (s,i), Exception("Atom not found at %d; '%s' found instead." % (i, s[i]))
else:
return (s, i), s[start:i]
char and atom are a bit more complicated than skipspace because they return a value and also give debugging info. This fills in one more piece of the interface that I skipped earlier; if the value is an Exception, then it indicates that a parse error has occurred. Notice that the exception is returned, not raised. The driver code I show later will raise the exception.
So far the code I have shown is very low-level and not much like the grammar I want to copy. Let’s fix that with sexp and sexplist. They closely reflect the grammar specification:
def sexp(s, i):
(s,i), c = char('(')(s, i)
if c=='(':
(s,i), sexps = sexplist(s, i)
(s,i), _ = skipspace(s, i)
(s,i), _ = char(')')(s, i)
return (s,i), (sexps[0][0], sexps[1:])
else: # c isinstance Exception
(s,i), a = atom(s, i)
if isinstance(a, Exception):
return (s,i), a
else:
return (s,i), (a, [])
def sexplist(s, i):
(s,i), _ = skipspace(s, i)
(s,i), sexp1 = sexp(s, i)
if isinstance(sexp1, Exception):
return (s,i), []
else:
(s,i), rest = sexplist(s, i)
if isinstance(rest, Exception):
return (s,i), rest
else:
return (s,i), [sexp1]+rest
So let’s see…sexp checks for an open parenthesis and if it finds one, looks for a sexplist and then a close parenthesis. Otherwise, it returns an atom. Either way, the value returned is a pair of atom and its children–but a leaf has no children. sexplist is similar, except that it looks for an initial sexp and returns [] if it can’t find it. (Note: I wrote sexplist recursively, but a while-loop version with a list accumulator would probably be cleaner in Python.)
def run(s):
(s,i), _ = skipspace(s, 0)
(s,i), answer = sexp(s, i)
(s,i), _ = skipspace(s, i)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
There are a few things to notice about this style. First, modulo the exception-checking, the code adheres pretty closely to the grammar specification. There’s a lot of boilerplate and calls to skipspace, but the basic structure is similar. That’s the advantage of this style. Second, there’s a lot of boilerplate on each line as the string and index are updated. Even when you don’t care about the value returned by a parser, you have to call it something (I use _). Also, there’s a lot of checking for Exception to see if the parse has failed. In fact, it should really happen after every parser call, I was just too lazy to put all that checking in.
Fortunately, both the assignment and error-checking boilerplate is extremely regular and easily factored out, leaving the grammar-like advantages intact. This is where the monad functions come into play. I’ll define and use those next time.
For now, here are some examples of the parsers:
>>> run('foo')
('foo', [])
>>> ('atom', [('foo', []), ('bar', [])])==run('(atom foo bar)')
True
>>> run(test2)
('S', [('VVPS', [('POPPHH', [])]), ('VNDD', [('HVPS', [])]), ('VVIV', [('VVSN', [])]), ('NP', [('CONJP', [('NN', [('PR', [])]),
('IK++', [('PN', [])]), ('NN', [('PR', [])]), ('IK++', [('NNDDSS', [])]), ('NN', [('PORP', [])]), ('IK++', [('POOP', [])]),
('NN', [('VVPT', [])]), ('RO', [('PR', [])])]), ('NN__SS', [('PODP', [])])]), ('NP', [('PN____GG', [('NNDD', [])]),
('NN', [('IP', [])])])])