« A Parser For Yousandersn.com now redirects to this blog »

Avoid Casual Parsing

15/01/09

Permalink 11:07:39 am, 487 words
Categories: Code

Avoid Casual Parsing

Here’s a quote from Real World Haskell, chapter 16:

In many popular languages, people tend to put regular expressions to work for “casual” parsing. They’re notoriously tricky for this purpose: hard to write, difficult to debug, nearly incomprehensible after a few months of neglect, and they provide no error messages on failure.

If we can write compact Parsec parsers, we’ll gain in readability, expressiveness, and error reporting. Our parsers won’t be as short as regular expressions, but they’ll be close enough to negate much of the temptation of regexps.

I can’t agree more. Regular expressions have two problems. The first is obvious: they look like line noise. Of course, you can argue that a compact representation is a boon if you are careful with your formatting. And you can use tricks like those I talked about in Make Regular Expressions Suck Less to make things more readable.

The more subtle problem is that regular expressions are exactly what they say: regular. If your parsing problem isn’t regular, you’ll spend a lot of time writing an expression with leaks that can never be fully plugged. And it’s hard to tell when a problem is regular or not, so you should probably default to something that has more power for anything more than truly simple cases. My favourite cases are CSV files and e-mail addresses. The full CSV specification is not regular; while e-mail addresses are regular, they require a monstrous regex to parse. Both should be specified with a more powerful parser.

Regexes are fine for use with grep and editor search/replace, or for one-off scripts whose output will be inspected by a human. But you should be uneasy if you start using them in the middle of permanent software*. Their regularity means they are not powerful enough for most parsing, and the line-noise property makes them unsuitable for the remaining regular cases that run too long.

To conclude, I wanted to put a rip-roaring recommendation of Parsec or at least some parsing package in here (every language has one). Unfortunately, I’m just learning Parsec myself, and having a bit of trouble. Parsing is just hard. No two ways around it. But the safest way is to do something well or not at all. I usually take the second route but I need to know how to take the first when it’s really necessary**.

Note: Real World Haskell uses the phrase “in many popular languages” all over the book when they want to refer to some language with a bad cultural habit without naming names. The language they’re talking about here is Perl. Perl 6 (may it soon be released) notably fixes the regex problem by introducing grammars consisting of rules.

*You can make an exception for heuristics, which you know to be lossy, although in this case you still might want a context-free parser working for you.
**Rolling your own parser from scratch, while fun, does not count.

7 comments

Comment from: Seth Gordon [Visitor] · http://ropine.com/yesh/
Note: Real World Haskell uses the phrase “in many popular languages” all over the book when they want to refer to some language with a bad cultural habit without naming names. The language they’re talking about here is Perl.
I disagree; I now work in a Python shop and I've encountered quite a bit of Python code that sends a regexp to do a parser's job.

Also, these non-parsers are typically trying to handle HTML, XML, or some other SGML derivative. Even in Perl, open-source parsers for these languages have been available since way back.

15/01/09 @ 14:36
Comment from: sandersn [Member]
@Seth
Good point about SGML derivatives. I forget that in the Real World people switched over to XML faster than they have in academia. I guess it's even more annoying to have to read the millionth half-baked tag parser instead of a half-baked parser for every half-baked format.
15/01/09 @ 15:24
Comment from: Kaleb [Visitor] Email · http://kalebcaptain.com
I did refer to this post on my blog today - hehe!
15/01/09 @ 17:11
Comment from: Mike K [Visitor]
You seem to be saying that regexes are evil and have no place in regular code - I couldn't disagree more. Regexes are a huge improvement over what they replace, which is programatically parsing with C-functions like "substr", "strchr".... One regular expression can take the place of a paragraph of C code. If I'm maintaining an application I know which I'd rather deal with.

Re your other points:
  • The "line noise" argument is silly: Every progamming language I know of uses non-alphanumeric characters for the start and end of a function, indexes, parameters, etc., if those aren't a problem, why the fear of the same characters in a regular expression

  • Monstrous regexes: Regexes do some things very well but they have limits and everyone knows that. Most of the time you can get the job done with a series of regexes and joined up with computational logic - there's no need ever to write a colossal regex

  • There will always be people who write big bad regexes just to show off, but you get bad behaviour patterns in every language: Some Perl developers think an app isn't an app until it pokes around with symbol table, some Java developers won't write anything that doesn't go through 20 layers of abstraction - you get that everywhere

  • RFC822, CSV, other common problems: Any sane developer will check to see if someone else has solved the these problems, before writing bad regexes to solve these problems

  • YACC: I'm sure it's more powerful, but look at the manual size (gnu bison vs perl regex doc) - look at the number of pages you'll have to read before you can get the job done - regexes will win for most of the smaller tasks

  • Perl6 grammar: Looks lovely, but perl6 isn't out yet, so irrelevant for anyone who's got to deliver software this year



Regular expressions may get a bad reputation because of the regex abuse, but show me a language feature that hasn't been abused by someone, somwhere.
15/01/09 @ 17:57
Comment from: Rose [Visitor] Email · http://viewthesource.org
I'm going to side with Sanders on this one. The "line noise" point is valid because the non-alphanumeric characters for the start and end of a function and such are the foundations of that language, while regular expressions look nothing like the rest of that language. They're often like a language of their own.
The trick is, unless the programmer uses them in a single language enough, he'll forget aspects of them, though he knows the rest of the language well. Yes, I know it's possible to learn regexes, but as I said, they're their own language. If you intend to write code that someone else will understand later, simpler is better.
The irony here is that my argument against regexes is similar in principle to my argument against Sander's favorite style of programming - making tons and tons of utility functions on top of each other, which makes code less readable as well. He's basically creating his own version of regex.
Heh, now that I've disagreed with both a commenter and the author, who else can I tease? Eh, why not Kaleb! Hey Kaleb, I understood his post and you didn't. haha j/k
15/01/09 @ 18:56
Comment from: Phil! Gold [Visitor] Email · http://aperiodic.net/phil/
while e-mail addresses are regular, they require a monstrous regex to parse

Just a pedantic note: email addresses are not regular, because they can contain arbitrary levels of nested comments. The Mail::RFC822::Address module addresses this by stripping out the comments before matching with its regular expression.

16/01/09 @ 08:10
Comment from: Brennen [Visitor] Email · http://p1k3.com/
"Many languages" here could also reasonably include JavaScript, where people do remarkable amounts of ill-advised input validation with poorly understood regexen, and PHP, where people do remarkable amounts of ill-advised everything.
16/01/09 @ 11:49

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)

Nathan Sanders : Journal

blog software