| « A Parser For You | sandersn.com now redirects to this blog » |
Here’s a quote from Real World Haskell, chapter 16:
In many popular languages, people tend to put regular expressions to work for “casual” parsing. They’re notoriously tricky for this purpose: hard to write, difficult to debug, nearly incomprehensible after a few months of neglect, and they provide no error messages on failure.
If we can write compact Parsec parsers, we’ll gain in readability, expressiveness, and error reporting. Our parsers won’t be as short as regular expressions, but they’ll be close enough to negate much of the temptation of regexps.
I can’t agree more. Regular expressions have two problems. The first is obvious: they look like line noise. Of course, you can argue that a compact representation is a boon if you are careful with your formatting. And you can use tricks like those I talked about in Make Regular Expressions Suck Less to make things more readable.
The more subtle problem is that regular expressions are exactly what they say: regular. If your parsing problem isn’t regular, you’ll spend a lot of time writing an expression with leaks that can never be fully plugged. And it’s hard to tell when a problem is regular or not, so you should probably default to something that has more power for anything more than truly simple cases. My favourite cases are CSV files and e-mail addresses. The full CSV specification is not regular; while e-mail addresses are regular, they require a monstrous regex to parse. Both should be specified with a more powerful parser.
Regexes are fine for use with grep and editor search/replace, or for one-off scripts whose output will be inspected by a human. But you should be uneasy if you start using them in the middle of permanent software*. Their regularity means they are not powerful enough for most parsing, and the line-noise property makes them unsuitable for the remaining regular cases that run too long.
To conclude, I wanted to put a rip-roaring recommendation of Parsec or at least some parsing package in here (every language has one). Unfortunately, I’m just learning Parsec myself, and having a bit of trouble. Parsing is just hard. No two ways around it. But the safest way is to do something well or not at all. I usually take the second route but I need to know how to take the first when it’s really necessary**.
Note: Real World Haskell uses the phrase “in many popular languages” all over the book when they want to refer to some language with a bad cultural habit without naming names. The language they’re talking about here is Perl. Perl 6 (may it soon be released) notably fixes the regex problem by introducing grammars consisting of rules.
*You can make an exception for heuristics, which you know to be lossy, although in this case you still might want a context-free parser working for you.
**Rolling your own parser from scratch, while fun, does not count.
Note: Real World Haskell uses the phrase “in many popular languages” all over the book when they want to refer to some language with a bad cultural habit without naming names. The language they’re talking about here is Perl.I disagree; I now work in a Python shop and I've encountered quite a bit of Python code that sends a regexp to do a parser's job.
Also, these non-parsers are typically trying to handle HTML, XML, or some other SGML derivative. Even in Perl, open-source parsers for these languages have been available since way back.
while e-mail addresses are regular, they require a monstrous regex to parse
Just a pedantic note: email addresses are not regular, because they can contain arbitrary levels of nested comments. The Mail::RFC822::Address module addresses this by stripping out the comments before matching with its regular expression.