| « Conveniently pretty-printing in Haskell | A Parser For You » |
Before starting my dissertation research in earnest, I need to do some analysis of my previous experiment. Specifically, I need to identify the most important features that my classifier used to tell London and Scotland apart.
My initial guess, that double modals (eg “might could") played a role, went down in flames as I didn’t find any double modals in the corpus. So I reversed my approach late last week and decided to have the main program spit out debugging information. For each of its trials (100), it prints out the 5 most important features, where “most important” is defined as having the largest absolute value.
And that was about all the C++ I could take. 20 lines suffices for about a month, personally. So I saved the output and decided to process it in another language.
This was the perfect opportunity to learn Parsec. Well, it *might* have been the perfect opportunity to just chuck it and use regular expressions in Python*, but I’d rather learn Parsec.
Each line of the input file has 5 feature/value pairs. The feature and value are separated by a space and the pairs are separated by tabs. My script needs to read in all pairs from the file, group the values for each feature together, and average them. Pretty simple.
Here’s the parsing part, featuring Parsec (imports not shown because they are Boring):
parseLines = parse line "parse error"
line = pair `endBy` (char '\t')
pair = do code <- many1 (noneOf " \t")
char ' '
value <- number
return (code, value)
number = do s <- getInput
case readSigned readFloat s of
[(n, s')] -> setInput s' >> return n
_ -> mzero
parseLines is just a wrapper for the parsing function, which takes a parser and an error message to return. line is the top-level parser that runs on each line: a line is a series of pairs, each ended by a tab. Notice that Parsec doesn’t allow me to use bare chars or strings. I have to say char ‘\t’.
pair is written in a more imperative style. I don’t use do much with the IO monad, but Parsec introduces its own monad which overloads the syntax inside do. Here, var <- parser save a result produced by parser as var, while a bare parser parses something but doesn’t save the result. For example, I save the result of many1 (noneOf ” \t") but not char ‘ ‘ because I don’t care about the space. Incidentally, many1 (noneOf ” \t") is equivalent to the regular expression [^ \t]+. (I never claimed a real parser was shorter than regexen.)
Of course there are ways to make the structure of the parser shorter (though I don’t know of a way to shorten the low-level lexical specification; this may be why most parsing packages use regexes for lexical recognition). This is Haskell, a big language, so there is more than one way to do it. Specifically, do is just syntactic sugar for a confusing array of functions. For example, I could merge the skipped space into the next line with >>: value <- char ‘ ‘ >> number. >> is sort of the functional equivalent of C’s semi-colon: do this first thing, discard the result, and then return this second thing.
In fact I could collapse the whole thing into one line: liftM2 (,) (many1 (noneOf ” \t")) (char ‘ ‘ >> number). But this requires you to recognise both liftM2 and the prefix syntax for creating pairs (,). For now I prefer the less compact do notation. That will probably change as I get better at writing grammars.
Finally, number is just something I stole from Real World Haskell. It shows how to hook directly into the parser by calling getInput and passing the resulting (lazy) string to your hand-rolled parsing functions. Or, in this case, the built-in Haskell parsing functions readSigned and readFloat. This is not something I could have written on my own; I would have parsed the numbers as strings and then mapped readSigned readFloat over the output of the parser instead.
The final result of parseLines is [(String, Double)]; though it actually returns Either ParseError [(String,Double)], I strip out all the errors immediately, confidently assuming there will be none. I then process the list of pairs in The Usual Way:
main = print . map avg . cluster . map parseLines . lines =<< readFile "file.txt"
where cluster = histogram (==) fst . concat . rights
avg = liftPair (decode . fst . head) (average . map snd)
average l = sum l / fromIntegral (length l)
histogram p key = groupBy (p `on` key) . sortBy (comparing key)
decode = foldr (\ c total -> 223 * total + (ord c - 32)) 0
liftPair f g a = (f a, g a)
average l = sum l / fromIntegral (length l)
I give the rest of my code not to explain it but to let the Haskell community come through and suggest improvements. They’ve been quite amazing so far. By the way, I know my average function is terribly inefficient. I need to start writing a utility library for Haskell so I can hide away efficient but ugly implementations like that.
Incidentally, I couldn’t find a way to pretty print in Haskell. Am I totally missing something?
*Though the core of this experiment is written in C++, the surrounding program is Python, so it would have entailed a bit less juggling than a Haskell script. I still feed some of Haskell’s output into Python. At least the list/string/number syntax is the same. This suggests another way to have approached this parsing problem: have C++ emit the data in Haskell syntax and just use read =<< readFile. That’s what I always did with Scheme.
prettyPrint :: Pretty a => a -> String