« Conveniently pretty-printing in HaskellA Parser For You »

Real World Parsec

20/01/09

Permalink 01:10:30 pm, 927 words
Categories: Code, Linguistics

Real World Parsec

Before starting my dissertation research in earnest, I need to do some analysis of my previous experiment. Specifically, I need to identify the most important features that my classifier used to tell London and Scotland apart.

My initial guess, that double modals (eg “might could") played a role, went down in flames as I didn’t find any double modals in the corpus. So I reversed my approach late last week and decided to have the main program spit out debugging information. For each of its trials (100), it prints out the 5 most important features, where “most important” is defined as having the largest absolute value.

And that was about all the C++ I could take. 20 lines suffices for about a month, personally. So I saved the output and decided to process it in another language.

This was the perfect opportunity to learn Parsec. Well, it *might* have been the perfect opportunity to just chuck it and use regular expressions in Python*, but I’d rather learn Parsec.

Each line of the input file has 5 feature/value pairs. The feature and value are separated by a space and the pairs are separated by tabs. My script needs to read in all pairs from the file, group the values for each feature together, and average them. Pretty simple.

Here’s the parsing part, featuring Parsec (imports not shown because they are Boring):

parseLines = parse line "parse error"
line = pair `endBy` (char '\t')
pair = do code <- many1 (noneOf " \t")
          char ' '
          value <- number
          return (code, value)
number = do s <- getInput
            case readSigned readFloat s of
              [(n, s')] -> setInput s' >> return n
              _ -> mzero

parseLines is just a wrapper for the parsing function, which takes a parser and an error message to return. line is the top-level parser that runs on each line: a line is a series of pairs, each ended by a tab. Notice that Parsec doesn’t allow me to use bare chars or strings. I have to say char ‘\t’.

pair is written in a more imperative style. I don’t use do much with the IO monad, but Parsec introduces its own monad which overloads the syntax inside do. Here, var <- parser save a result produced by parser as var, while a bare parser parses something but doesn’t save the result. For example, I save the result of many1 (noneOf ” \t") but not char ‘ ‘ because I don’t care about the space. Incidentally, many1 (noneOf ” \t") is equivalent to the regular expression [^ \t]+. (I never claimed a real parser was shorter than regexen.)

Of course there are ways to make the structure of the parser shorter (though I don’t know of a way to shorten the low-level lexical specification; this may be why most parsing packages use regexes for lexical recognition). This is Haskell, a big language, so there is more than one way to do it. Specifically, do is just syntactic sugar for a confusing array of functions. For example, I could merge the skipped space into the next line with >>: value <- char ‘ ‘ >> number. >> is sort of the functional equivalent of C’s semi-colon: do this first thing, discard the result, and then return this second thing.

In fact I could collapse the whole thing into one line: liftM2 (,) (many1 (noneOf ” \t")) (char ‘ ‘ >> number). But this requires you to recognise both liftM2 and the prefix syntax for creating pairs (,). For now I prefer the less compact do notation. That will probably change as I get better at writing grammars.

Finally, number is just something I stole from Real World Haskell. It shows how to hook directly into the parser by calling getInput and passing the resulting (lazy) string to your hand-rolled parsing functions. Or, in this case, the built-in Haskell parsing functions readSigned and readFloat. This is not something I could have written on my own; I would have parsed the numbers as strings and then mapped readSigned readFloat over the output of the parser instead.

The final result of parseLines is [(String, Double)]; though it actually returns Either ParseError [(String,Double)], I strip out all the errors immediately, confidently assuming there will be none. I then process the list of pairs in The Usual Way:

main = print . map avg . cluster . map parseLines . lines =<< readFile "file.txt"
    where cluster = histogram (==) fst . concat . rights
          avg = liftPair (decode . fst . head) (average . map snd)
average l = sum l / fromIntegral (length l)
histogram p key = groupBy (p `on` key) . sortBy (comparing key)
decode = foldr (\ c total -> 223 * total + (ord c - 32)) 0
liftPair f g a = (f a, g a)
average l = sum l / fromIntegral (length l)

I give the rest of my code not to explain it but to let the Haskell community come through and suggest improvements. They’ve been quite amazing so far. By the way, I know my average function is terribly inefficient. I need to start writing a utility library for Haskell so I can hide away efficient but ugly implementations like that.

Incidentally, I couldn’t find a way to pretty print in Haskell. Am I totally missing something?

*Though the core of this experiment is written in C++, the surrounding program is Python, so it would have entailed a bit less juggling than a Haskell script. I still feed some of Haskell’s output into Python. At least the list/string/number syntax is the same. This suggests another way to have approached this parsing problem: have C++ emit the data in Haskell syntax and just use read =<< readFile. That’s what I always did with Scheme.

4 comments

Comment from: Crutcher [Visitor] Email
http://haskell.org/ghc/docs/latest/html/libraries/pretty/Text-PrettyPrint-HughesPJ.html

This is the semi-standard pretty print library. Its like parsec, but backwards.
21/01/09 @ 01:32
Comment from: ADEpt [Visitor] Email
There is also Text.Printf
21/01/09 @ 09:22
Comment from: sandersn [Member]
Thanks, Crutcher. I saw it in the library documentation, but the interface was not at all what I was expecting. I was hoping for something like

prettyPrint :: Pretty a => a -> String

with defined instances of Pretty for all the common types. (or with return type IO () maybe)

I guess I just need to write some wrapper code to do that.
21/01/09 @ 14:43
Text.Printf will do the trick!
12/09/09 @ 08:26

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
blog tool