« Real World ParsecAvoid Casual Parsing »

A Parser For You

19/01/09

Permalink 01:00:32 pm, 518 words
Categories: Code

A Parser For You

So cool a parser, I thought you might enjoy it!

I mentioned Yacc last time, but it’s only one of the parsers that people use today. (Today, in fact, you’re more likely to use Bison, the GNU equivalent.) Yacc lets you write a grammar that looks like this:

%{
    int yylex(void);
    void yyerror(char *);
%}


%token INTEGER

%%

program:
        program expr '\n'         { printf("%d\n", $2); }
        | 
        ;

expr:
        INTEGER                   { $$ = $1; }
        | expr '+' expr           { $$ = $1 + $3; }
        | expr '-' expr           { $$ = $1 - $3; }
        ;

%%

void yyerror(char *s) {
    fprintf(stderr, "%s\n", s);
}

int main(void) {
    yyparse();
    return 0;
}

It works a bit like ASP/JSP/etc: the body of the code is a series of rules with associated C blocks that have special variables bound like $$ and $1. Yacc translates the rules into C code and you can compile the resulting program. This example is taken from A Compact Guide to Lex & Yacc. You’ll also need to use Lex (or Flex) to write a lexer that recognises INTEGERs, which is given in the Compact Guide.

In the grammar above, lex looks for INTEGER tokens using a simple expression: [0-9]+. That’s all lex is, really, a series of regular expressions that describe what the individual tokens of your language look like. Lest you chortle at the use of regexes, I would like to point out that lex encourages you to break them up and reuse abstractions, so it works around the line-noise problem at least.

If you’re not writing a lot of C these days (and who is?*), then Wikipedia has a list of Yacc clones for other languages. But you might want something a bit more modern. For Java, ANTLR is the most commonly used parser. I’ve never used Antlr, so I had a preconceived idea that grammars were written in XML. Not so:

grammar Expr;

@header {
import java.util.HashMap;
}

@members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}

prog:   stat+ ;
                
stat:   expr NEWLINE {System.out.println($expr.value);}
    |   ID '=' expr NEWLINE
        {memory.put($ID.text, new Integer($expr.value));}
    |   NEWLINE
    ;
/** Skip down to the lexing bit ... (Lexing is not separate as with Yacc) */
ID  :   ('a'..'z'|'A'..'Z')+ ;
INT :   '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;

(Full grammar here) Pretty similar to Yacc+Lex actually. I was pleasantly surprised. Anltr can generate programs in C, C++, Java, Python and C#, but is most closely associated with the Java world.

Aside from the Big Two languages, most languages have a yacc-like library. I’ll leave it you to find one for your language if you want it.

I was going to write about parsing DSLs in the trendier/smarter languages, but I can’t find anything in Ruby, Python or Javascript that has taken over like Parsec has for Haskell. There are a dozen or so libraries that look good but none that have the unanimous support of the community, and most are just modernised equivalents of yacc.

Oh, well. If you know something I don’t, say so in the comments. Next time I’ll present a real problem that I solved using Parsec. It’s a regular problem, but I really do hate regexes enough to learn a parsing library in order to avoid them forever.

2 comments

Comment from: Allen Short [Visitor] Email · http://washort.twistedmatrix.com/
Cheap plug: Check out PyMeta, an OMeta implementation in Python. :)

http://launchpad.net/pymeta
19/01/09 @ 14:02
Comment from: madMilo [Visitor] Email
Check Treetop for a nice ruby parser generator based on PEG.

http://treetop.rubyforge.org/index.html
19/01/09 @ 16:44

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)

Nathan Sanders : Journal

powered by b2evolution