« Classifying Rock Band, Metroid Prime 3 and Portal in 2D spaceA 2D coordinate space for games - draft »

Tips for using the Berkeley parser

11/02/10

Permalink 09:42:40 pm, 336 words
Categories: Code, Linguistics

Tips for using the Berkeley parser

I am using the Berkeley parser for my dissertation because it appears to be the best PCFG parser for non-English languages, given how much work I want to put into training it at least.

But it has its quirks. Fortunately, it also has its source code, available on Google Code. So I was able to get past several problems that would otherwise have been mystifying. I’m going to write them down so I can remember them, and maybe if someday you, too, use the Berkeley parser, you will need to know these things.

Use pre-specified parts of speech

The parser does its own POS-tagging, but in case you have your own tagger (I’m using T’n'T), you can probably get better results from it. The flag for this is not documented in the README, only in the command line help: -useGoldPOS. Well, the documentation there is pretty sparse too: “Read data in CoNLL format, including gold part of speech tags.”

Transform your data to “Berkeley CoNLL format”

The CoNLL format is reasonably well specified, but it’s a good thing I read the input code for the Berkeley parser, because this is not it. The format is like CoNLL, in that it’s tab-separated, has each word on its own line, and separates sentences with a blank line, but there are two important differences.

  1. Only two columns: word and POS. There are no ID or LEMMA columns (or CPOSTAG, or FEATS …).
  2. The file must END with a blank line as well as separate each sentence with one. In other words, put a blank line after each sentence.

These two differences are totally undocumented as far as I know. But the code is pretty straightforward.

Include grammatical function

If your CoNLL “fine” tag set (as opposed to the coarse tag set) separates part of speech from grammatical function with a dash, like so: “NN-SUBJ", you’ll have to change it—the input reader takes only the string up to the first dash. So you’ll want something like “NN_SUBJ” instead.

1 comment

Comment from: mRbAdAw [Visitor]
I thought that will be more useful for one who tends to use the berkeley parser fro the 1st time.
11/04/10 @ 03:15

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution free blog software