« Faking type classes in C#Developing Fing, part 1: Parsec tokeniser »

Developing Fing, part 2: Parsec parser 2

11/04/10

Permalink 08:35:21 am, 885 words
Categories: Code

Developing Fing, part 2: Parsec parser 2

Last time I explained that I got stuck writing an F# type parser for Fing in FParsec. So I switched to Parsec. This time, I'll show the entire Parsec parser instead of just the tokeniser. Here's the tokeniser again (along with the imports I need for the rest of the code).

import Text.ParserCombinators.Parsec
import Control.Monad (sequence, msum, liftM, liftM2, when)
import Data.Either (rights)
import Data.List (intercalate)
import Data.Maybe (isJust)
tokeniser = choice [try (many1 (digit <|> letter <|> oneOf ".`_"))
                   , string "->"
                   , try (string ":>")
                   , (return . list =<< oneOf "*<>()[],'^#:?")
                   ]
tok t = try $ do
  spaces
  x <- tokeniser
  if x==t then return x else fail "Token not recognised"
toks = mapM tok
notTok ts = try $ do
  spaces
  x <- tokeniser
  if x `elem` ts then fail "Incorrect token" else return x

Read my previous post if you want an explanation of the tokeniser. Before I start, let me warn you that I'm going to explain the parser twice. This post covers the parser's input—the strings it accepts. The next post covers the parser's output—the structures it generates. So ignore anything that says 'return' on it (or liftM). I'll explain it later.

The reason I'm breaking it up like this is that I'm sure my parser works, but I'm not sure the type representation I'm building up is correct; I really need to try to fit some real .NET types into it to find out. And I can't do that until the FParsec version is done. So...take anything with 'return' on it with a grain of salt, even if you can't ignore it.

identP = notTok ["lazy", "with", "when", "if", "else", "do", "new", "get", "set"
                , "<", ">", "[", ",", "]", "(", ")", "#", ":", "?"
                , "->", "*", "'", "_", "^", ":>"]

First up is the basic identifier. I don't have all the F# keywords here yet, and I think I'm also missing some punctuation, but you get the idea. An identifier is a token that is not one of these things.

Here's an example of some things that can be parsed so far:

  • int
  • list
  • Microsoft.FSharp.Core.list`1
arrowP = liftM (toplevel Arrow) $ sepBy1 tupleP (tok "->")
tupleP = liftM (toplevel Tuple) $ sepBy1 typeP (tok "*")
typeP = do
  t <- termP
  option t (return . nestTypes t . reverse =<< many typeSuffixP)
termP = typevarP <|> between (tok "(") (tok ")") arrowP <|> longidentP

Identifiers are the heart of the parser, but the "main loop" is actually these four parsers: arrows, tuples, types and terms. For arrowP and tupleP, ignore what they return for now and focus on the part after the dollar sign. It's pretty simple: an arrow is a bunch of tuples separated by arrows and a tuple is a bunch of types separated by stars.

A type is a term followed by an optional type suffix. That's the Caml-style int list, plus some other things. There can be many of these, for example: int list list option ref.

A term is a type variable, a parenthesised arrow or a "long" identifier. That means an identifier, optionally followed by a .NET-style generic type. In other words, instead of int list (Caml-style), you write list<int> (.NET-style). I personally prefer the .NET style because I think it reads more naturally and explicitly, especially when you try to mentally parse int list list option ref versus ref<option<list<list<int>>>> when read aloud. (The first one requires you to build the mental structure bottom-up instead of top-down.)

longidentP = do
  id <- identP
  option (Id id) (postfixNameP (Id id))
postfixNameP id =
  liftM (Generic id) $ between (tok "<") (tok ">") (sepBy arrowP (tok ","))
typevarP = choice [anon, normal, structural]
  where anon = tok "_" >> return (Var Anonymous)
        normal = tok "'" >> liftM (Var . Normal) identP
        structural = tok "^" >> liftM (Var . Structural) identP

Type variables are pretty simple: either the anonymous _, the normal 'a or the structural ^a. I'm not up on the semantics of structural variables, so even though they're the most mysterious of the three, I can't explain them right now. Sorry.

So, given those seven parsers, here are some things that they can parse:

  • int->int
  • int->int->int
  • int*int
  • int->int*int->int
  • int*int->int*int
  • (((int)))
  • list
  • list<>
  • list<int>
  • list<int->int>
  • list<int*int>
  • list<int*(int->int*int)>
  • list<list<int>>
  • int list list
  • 'a->'a
  • 'a*'a
  • _*^a
  • list<int,'a,_>

That's enough to parse most types, so if you stop reading here, you won't miss much. Mainly what's left is more .NET compatibility (arrays) and advanced F# features (constraints). Constraints are complicated to parse, so they take about 3 times more code than the rest of the parser put together.

In fact, because I'm short on time, I'm going to break the post in two so that at least I'll have something to post today. Here's the starting point for next time: the type suffix parser which I left undefined above.

typeSuffixP = choice [arrayP <?> "an array type"
                      , tok "lazy" >> return (Generic (Id "lazy") . list)
                      , tok "when" >> constraintP
                      , (do id <- identP; return (Generic (Id id) . list))
                                  <?> "a stupid suffix type"]

No feedback yet

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
open source blog tool