| « Monadic parsing in Python, part 4 | Monadic parsing in Python, part 2 » |
OK, are you ready to get Properly Object-Oriented? The new version of the monadic parser combinators from parts 1 and 2 of this series will have inheritance, polymorphism and information hiding like nobody’s business. Let’s start off with the Abstract Base Class*, Parser.
*Abstract Base Class not actually used. This post uses Python 2.5. This post offers no guarantees of suitability or completeness; all holes in post left as reader exercises.
class Parser():
def parse(self, s, i):
raise NotImplementedError()
def run(self, s):
(s,i), answer = self.parse(s, 0)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
def runsexp(s):
return Bind(Chain(SkipSpace(), Sexp()),
lambda answer: Chain(SkipSpace(), Return(answer))).run(s)
A parser is now a class with a parse method that takes a string and an index and returns a tuple of (string,index) along with a value. This regularises the usage and construction of parsers a bit, as you can see in runsexp as a bit of a preview. run and runsexp are now properly factored into two pieces; run will run any parser, while runsexp runs our particular S-expression grammar on a parser.
Here are the simple parser combinators. They’re mostly the same, but are now OO. They inherit from Parser and implement the polymorphic parse method. And uh, they don’t really hide information. But you will notice that the inner functions named parse from last time have now become Parser.parse.
class Atom(Parser):
def parse(self, s, i):
if i >= len(s): return Fail("Ran off end of string").parse(s,i)
start = i
while i < len (s) and not s[i].isspace() and not s[i]==')' and not s[i]=='(':
i += 1
if i==start:
return Fail("Atom not found at %d; '%s' found instead." % (i, s[i])).parse(s,i)
else:
return (s,i), s[start:i]
class SkipSpace(Parser):
def parse(self, s, i):
while i < len(s) and s[i].isspace():
i += 1
return (s,i), ()
class Char(Parser):
def __init__(self, c):
self.c = c
def parse(self, s, i):
if i >= len(s): return Fail("Ran off end of string").parse(s, i)
if s[i]==self.c:
return (s,i+1), self.c
else:
return Fail("'%s' expected at %d, but '%s' found instead. [%s]" %
(self.c,i,s[i],s[max(0,i-5):min(len(s), i+5)])).parse(s, i)
Here are the monadic parser combinators:
class Chain(Parser):
def __init__(self, parser1, parser2):
self.parser1 = parser1
self.parser2 = parser2
def parse(self, s, i):
(s,i), a = self.parser1.parse(s, i)
if isinstance(a, Exception): return (s,i), a
else: return self.parser2.parse(s, i)
class Bind(Parser):
def __init__(self, parser1, f):
self.parser1 = parser1
self.f = f
def parse(self, s, i):
(s,i), a = self.parser1.parse(s, i)
if isinstance(a, Exception): return (s,i), a
else: return self.f(a).parse(s, i)
class Return(Parser):
def __init__(self, x):
self.x = x
def parse(self, s, i):
return (s,i), self.x
class Fail(Parser):
def __init__(self, msg):
self.msg = msg
def parse(self, s, i):
return (s,i), Exception(self.msg)
class Or(Parser):
def __init__(self, *alts):
self.alts = alts
def parse(self, s, i):
for alt in self.alts:
(snoo,inew), a = alt.parse(s, i)
if not isinstance(a, Exception):
return (snoo,inew), a
return (snoo,inew), a # (s,i), a
This is all very nice, but I would like to point out that it’s much more verbose than the functional style. Unless you are an OO zealot, the only two reasons to make this change would be to (1) help fellow programmers who can’t read functional style yet or (2) take advantage of operator overloading, which is only available to classes in Python.
In fact, changing the parser superclass to the following will let us use nicer syntax. We just need to add a few magic methods to the base class:
class Parser():
def parse(self, s, i):
raise NotImplementedError()
def run(self, s):
(s,i), answer = self.parse(s, 0)
if i==len(s) and not isinstance(answer, Exception):
return answer
else:
raise answer
def __rshift__(parser1, parser2):
return Chain(parser1, parser2)
def __getitem__(parser, f):
return Bind(parser, f)
def __or__(parser1, parser2):
return Or(parser1, parser2)
This change allows us to write Sexp and SexpList like this:
class Sexp(Parser):
def parse(self, s, i):
return ((Char('(') >> SexpList())[lambda sexps:
SkipSpace() >> Char(')') >> Return((sexps[0][0], sexps[1:]))]
| Atom()[lambda a: Return((a, []))]
| Fail("Couldn't find closing parenthesis")).parse(s, i)
class SexpList(Parser):
def parse(self, s, i):
return ((SkipSpace() >> Sexp())[lambda sexp1:
SexpList()[lambda rest: Return([sexp1]+rest)]]
| Return([])).parse(s, i)
I’m not sure how much I like the bind syntax—I just chose something that would let me leave off parentheses around lambda—but the other two operators are clear winners, I think. Writing SexpList like this also shows that the recursive definition is now simpler than the manual implementation with a
while-loop.
For the final entry in this series, I’ll look at the performance of the three variants of this style. I may also look at the performance of a hand-rolled state machine, a recursive descent parser or a yacc-generated parser. But I really doubt it.