« Nested code preferenceAn untyped Python module to match the typed module »

Make regular expressions suck less

04/08/08

Permalink 08:56:49 am, 708 words
Categories: Code, Linguistics

Make regular expressions suck less

I hate regular expressions. I know, I know. It’s CRAZY for me to say that. I work on computational phonology, and phonology is basically a regular language. I’ve taken classes in which finite state machines, the theory behind regular expressions, figured prominently, and we had to implement our own regex engine. I am training a probabilistic finite state machine, a hidden Markov model, as we speak. Err, well, actually it’s buggy right now, and I’m not speaking, so I actually I will resume debugging it when I finish writing this post. But you get the idea.

The problem that is unique to me is that actually I can’t use normal regular expressions for phonology because they are deterministic. They aren’t meant to capture variation or uncertainty. But most people don’t have that problem. And I don’t have that problem for my everyday tasks of parsing corpora and config files. Regular expressions still have problems in their intended niche. Problem Number One, and one that I don’t intend to fix, is that the syntax is incredibly terse. Probably half of Perl’s reputation for line noise comes from the fact that Perl mongers think that every line needs a regular expression on it. You can work around Problem One by using the verbose/ignore-white-space option available in most regex engines.

Problem Number Two might be more important than Problem One, I’m not sure. It’s the complete inability to form abstractions. Without the ability to form abstractions, the ignore-white-space option is the equivalent to formatting your assembly code nicely with indentation and spaces to show structure. You still have to read one million lines of step-by-step instructions to the underlying engine. Regular expression syntax just does not provide any way to split regexes into chunks and re-use them.

Aha! But there is a way around this. In most languages, regular expressions are quoted in strings. And strings have plenty of ways to nest. String interpolation is the most obvious to me. Here’s an example. I’m parsing a config file in which hex digits figure prominently (I’ll blog later on the actual project if it turns out to be interesting). Here are the abstractions I create:

hexdigit = '[0-9A-Fa-f]'
hexbyte = '%s{2}|%s{4}' % (hexdigit, hexdigit)
hexaddress = '%s+h?' % hexdigit
hexdigits = dict(digit=hexdigit, byte=hexbyte, address=hexaddress)

I changed the name of the variables in the dictionary because the name space in regular-expression-land is a lot smaller and I wanted to type less. If I kept using hexdigits, I’d probably want to move it to its own module with its own namespace. Now, here’s the grammar*:

regexen = map(lambda (r,f):(r % hexdigits,f),
    [(r"(%(byte)s)=(.*)$", Tbl._table),
     (r"\$(%(byte)s)=(.*)$", Tbl._link),
     (r"\((%(address)s)\)(.*)$", Tbl._bookmark),
     (r"\[(%(address)s)-(%(address)s)\](.*)$", Tbl._dumpmark),
     (r"{(%(address)s)\+(%(address)s)-(.*)}(.*)$", Tbl._insertmark),
     (r"\*(%(byte)s)$", Tbl._endl),
     (r"/(%(byte)s)$", Tbl._ends),
     (r"!(%(byte)s)$", Tbl._dakuten),
     (r"@(%(byte)s)$", Tbl._handakuten),
     (r">$", Tbl._dakutenafter),
     (r"#$", Tbl._clean),])

This is Python, so the actual syntax for string interpolation is pretty ugly: %(byte)s. I heard there was a new module shipped at some point with improved Ruby-like ${byte} syntax. Or you could just use Ruby.

But you get the idea. The regex syntax is still a dense thicket of spindly saplings, all different, but it’s a lot simpler with my personal hex digit classes inserted. Compare this: !(%(byte)s)$ with this: !([0-9A-Fa-f]{2}|[0-9A-Fa-f]{4})$. That’s the power of abstraction.


*I’ll let you imagine the interpreter for the grammar. Wait, never mind. Here it is:

    def read(self, filename):
        for line in open(filename):
            for regex,f in regexen:
                m = re.match(regex, chomp(chomp(line, '\n'), '\r'))
                if m:
                    f(self, m)
                    break

**Note:grep may have a hex-digit class built-in, but I can’t tell from the documentation. To quote the documentation:

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].

Self-explanatory, eh? I actually have only a guess as to the contents of [:cntrl:], [:graph:], [:print:] and [:xdigit:]. A little explanation would have been appreciated. In any case, Python’s implementation doesn’t support these classes so it’s a moot point.

2 comments

Comment from: Thomack [Visitor] Email
A couple things, one I like your concept of using string mods to help provide reuse to an otherwise terse regex grammar. However for a short regex it just over complicates it. Unless of course I find myself retyping things over and over in which case I can copy/paste.

I didn't understand a few of those paragraphs in there, but I do agree regular expressions could be improved. However, I have no ideas for change so I just stick with what is out there.

My knowledge of regex's in grep is limited. I have taken up using them occasionally to avoid string together piped greps on the command line. I have found that grep regex differs from perl and python regexs in various ways and for various reasons. I also do not understand those char classes, but then I am not a robot. Clearly you are not quite a robot either.
05/08/08 @ 15:31
Comment from: sandersn [Member]
Yes, for one-offs then this is overkill. For me, the threshold for abstraction was when I noticed repetition--you're going to have to understand [0-9A-Fa-f]+h? anyway, but when you have to do it multiple times, that's when factoring it out becomes useful. It also helps that I hate regexen. The less of it I have to see the better.

Also, I later found the Python class that allows $variable syntax for string interpolation, string.Template. It makes the regexen even easier to read.
05/08/08 @ 16:07

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution free blog software