| « Classifying Rock Band, Metroid Prime 3 and Portal in 2D space | A 2D coordinate space for games - draft » |
I am using the Berkeley parser for my dissertation because it appears to be the best PCFG parser for non-English languages, given how much work I want to put into training it at least.
But it has its quirks. Fortunately, it also has its source code, available on Google Code. So I was able to get past several problems that would otherwise have been mystifying. I’m going to write them down so I can remember them, and maybe if someday you, too, use the Berkeley parser, you will need to know these things.
The parser does its own POS-tagging, but in case you have your own tagger (I’m using T’n'T), you can probably get better results from it. The flag for this is not documented in the README, only in the command line help: -useGoldPOS. Well, the documentation there is pretty sparse too: “Read data in CoNLL format, including gold part of speech tags.”
The CoNLL format is reasonably well specified, but it’s a good thing I read the input code for the Berkeley parser, because this is not it. The format is like CoNLL, in that it’s tab-separated, has each word on its own line, and separates sentences with a blank line, but there are two important differences.
These two differences are totally undocumented as far as I know. But the code is pretty straightforward.
If your CoNLL “fine” tag set (as opposed to the coarse tag set) separates part of speech from grammatical function with a dash, like so: “NN-SUBJ", you’ll have to change it—the input reader takes only the string up to the first dash. So you’ll want something like “NN_SUBJ” instead.