| « Nested code preference | An untyped Python module to match the typed module » |
I hate regular expressions. I know, I know. It’s CRAZY for me to say that. I work on computational phonology, and phonology is basically a regular language. I’ve taken classes in which finite state machines, the theory behind regular expressions, figured prominently, and we had to implement our own regex engine. I am training a probabilistic finite state machine, a hidden Markov model, as we speak. Err, well, actually it’s buggy right now, and I’m not speaking, so I actually I will resume debugging it when I finish writing this post. But you get the idea.
The problem that is unique to me is that actually I can’t use normal regular expressions for phonology because they are deterministic. They aren’t meant to capture variation or uncertainty. But most people don’t have that problem. And I don’t have that problem for my everyday tasks of parsing corpora and config files. Regular expressions still have problems in their intended niche. Problem Number One, and one that I don’t intend to fix, is that the syntax is incredibly terse. Probably half of Perl’s reputation for line noise comes from the fact that Perl mongers think that every line needs a regular expression on it. You can work around Problem One by using the verbose/ignore-white-space option available in most regex engines.
Problem Number Two might be more important than Problem One, I’m not sure. It’s the complete inability to form abstractions. Without the ability to form abstractions, the ignore-white-space option is the equivalent to formatting your assembly code nicely with indentation and spaces to show structure. You still have to read one million lines of step-by-step instructions to the underlying engine. Regular expression syntax just does not provide any way to split regexes into chunks and re-use them.
Aha! But there is a way around this. In most languages, regular expressions are quoted in strings. And strings have plenty of ways to nest. String interpolation is the most obvious to me. Here’s an example. I’m parsing a config file in which hex digits figure prominently (I’ll blog later on the actual project if it turns out to be interesting). Here are the abstractions I create:
hexdigit = '[0-9A-Fa-f]'
hexbyte = '%s{2}|%s{4}' % (hexdigit, hexdigit)
hexaddress = '%s+h?' % hexdigit
hexdigits = dict(digit=hexdigit, byte=hexbyte, address=hexaddress)
I changed the name of the variables in the dictionary because the name space in regular-expression-land is a lot smaller and I wanted to type less. If I kept using hexdigits, I’d probably want to move it to its own module with its own namespace. Now, here’s the grammar*:
regexen = map(lambda (r,f):(r % hexdigits,f),
[(r"(%(byte)s)=(.*)$", Tbl._table),
(r"\$(%(byte)s)=(.*)$", Tbl._link),
(r"\((%(address)s)\)(.*)$", Tbl._bookmark),
(r"\[(%(address)s)-(%(address)s)\](.*)$", Tbl._dumpmark),
(r"{(%(address)s)\+(%(address)s)-(.*)}(.*)$", Tbl._insertmark),
(r"\*(%(byte)s)$", Tbl._endl),
(r"/(%(byte)s)$", Tbl._ends),
(r"!(%(byte)s)$", Tbl._dakuten),
(r"@(%(byte)s)$", Tbl._handakuten),
(r">$", Tbl._dakutenafter),
(r"#$", Tbl._clean),])
This is Python, so the actual syntax for string interpolation is pretty ugly: %(byte)s. I heard there was a new module shipped at some point with improved Ruby-like ${byte} syntax. Or you could just use Ruby.
But you get the idea. The regex syntax is still a dense thicket of spindly saplings, all different, but it’s a lot simpler with my personal hex digit classes inserted. Compare this: !(%(byte)s)$ with this: !([0-9A-Fa-f]{2}|[0-9A-Fa-f]{4})$. That’s the power of abstraction.
*I’ll let you imagine the interpreter for the grammar. Wait, never mind. Here it is:
def read(self, filename):
for line in open(filename):
for regex,f in regexen:
m = re.match(regex, chomp(chomp(line, '\n'), '\r'))
if m:
f(self, m)
break
**Note:grep may have a hex-digit class built-in, but I can’t tell from the documentation. To quote the documentation:
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
Self-explanatory, eh? I actually have only a guess as to the contents of [:cntrl:], [:graph:], [:print:] and [:xdigit:]. A little explanation would have been appreciated. In any case, Python’s implementation doesn’t support these classes so it’s a moot point.