| « Monadic parsing in Python, part 3 | Python's iterators are a bad implementation of laziness : with concrete example » |
In part 1, I showed a fairly annoying but nicely regular method of parsing. Today I will show how a lot of the annoyingness can be hidden with a few driver functions. These functions will allow us to succinctly combine the simple parsers like char, skipspace and atom into complex parsers.
The functions are monadic, which means that they follow a well-known design pattern from functional programming. It’s not really important here because (1) I don’t generalise beyond parsing and (2) Python’s syntax makes it hard to make them convenient to use. But that’s why I chose some odd names like ‘ret’, short for return. Here’s the code:
def chain(parser1, parser2):
def chainedparse(s, i):
(s,i), a = parser1(s, i)
if isinstance(a, Exception): return (s,i), a
return parser2(s, i)
return chainedparse
def bind(parser, f):
def parse(s,i):
(s,i), a = parser(s,i)
if isinstance(a, Exception): return (s,i), a
return f(a)(s,i)
return parse
def ret(x):
def parse(s,i):
return (s,i), x
return parse
def fail(msg):
def parse(s,i):
return (s,i), Exception(msg)
return parse
The easiest to explain is chain, which runs two parsers but only returns the result of the second parser. This is useful for parsers like skipspace, where you can write (s,i), a = chain(skipspace, atom)(s,i) and avoid creating a bunch of variables named _. You can see that chain implements a single step of the boilerplate as well as the error checking that I was too lazy to put in by hand before. Before, code like sexp was a long chain of
: (s,i), sexps = sexplist(s, i) (s,i), _ = skipspace(s, i) :
Bind is similar to chain, but the result of the first parser is passed to the second parser so it can do something useful with it. Actually, you have to wrap the second parser in a single-argument function. The parser interface doesn’t have any additional space to receive the input. But bind takes care of basically the same boilerplate as chain does. Note that if the first parser returns an exception, the second parser is never run, so an Exception short-circuits evaluation like you’d expect.
Ret takes care of the case when you want to return a value from a parser without having to add in the (s,i) boilerplate. It’s particularly useful in conjuction with bind: you can say something like bind(atom, lambda a: ret((a, []))), which gives you a parser that parses an atom and wraps it in a tuple.
Fail does basically the same thing as ret, except that it returns an exception with a message. You’ll also notice that all of these functions immediately return a nested function. According to my current model, the inner function is the parser. For example, in chain, the inner function is a parser that chains together the two parsers passed to the outer function. The chained parser can then be used just like any other function that implements the parser interface. This is the combinator part of the “monadic parser combinators". Besides the monadic combinators, there are other useful ones, like or (called alt below):
def alt(*alts):
def parse(s,i):
for alt in alts:
(snoo, inew), a = alt(s,i)
if not isinstance(a, Exception):
return (snoo,inew), a
return (s,i), a # maybe return (snoo,inew), a ?
return parse
Alt tries each parser in turn and returns the first one that succeeds. If none succeed, it returns the exception from the last alternative.
Now let’s look at how the code for complex parsers changes with these combinators, sexp and sexplist:
def sexp(s, i):
(s,i), c = char('(')(s, i)
if c=='(':
return bind(sexplist, lambda sexps:
chain(skipspace,
chain(char(')'),
ret((sexps[0][0], sexps[1:])))))(s, i)
else: # c isinstance Exception
return bind(atom, lambda a: ret((a, [])))(s, i)
def sexplist(s, i):
(s,i), l = chain(skipspace, sexp)(s, i)
if isinstance(l, Exception):
return ret([])(s,i)
else:
return bind(sexplist, lambda rest:ret([l]+rest))(s, i)
(run and the simple parsers change very little, except to use fail, so I’ll not repeat them here.)
Actually, I didn’t use alt there. If Python supported recursive definitions or lazy definitions, alt would allow us to write point-free combinators:
sexp = alt(bind(chain(char('('), sexplist), lambda sexps:
chain(skipspace, chain(char(')'), ret((sexps[0][0], sexps[1:]))))),
bind(atom, lambda a: ret((a, []))),
fail("Couldn't find closing parenthesis"))
sexplist = alt(bind(chain(skipspace, sexp), lambda sexp1:
bind(sexplist, lambda rest:
ret([sexp1]+rest))),
ret([]))
A point-free combinator is one with no arguments. In other words, since alt creates a function that conforms to the parser interface, you don’t need to explicitly create your own parser function. However, Python doesn’t support recursive definitions because of its strict top-to-bottom evaluation. So we have to wrap the
alt-parser versions in a function definition in order to manually delay evaluation:
def sexp(s,i):
return alt(bind(chain(char('('), sexplist), lambda sexps:
chain(skipspace, chain(char(')'), ret((sexps[0][0], sexps[1:]))))),
bind(atom, lambda a: ret((a, []))),
fail("Couldn't find closing parenthesis"))(s,i)
def sexplist(s,i):
return alt(bind(chain(skipspace, sexp), lambda sexp1:
bind(sexplist, lambda rest:
ret([sexp1]+rest))),
ret([]))(s,i)
This is about the end of the road for semantic elimination of boilerplate. From here on out, it’s all syntactic tricks to make the highly nested style above more palatable; I’m no longer naive enough to think that most people will want the 4-line point-free version if one of the lines ends with 5 (five!) closing parentheses. That’s what I’ll discuss next time.