« Metric time of day display and conversionNovember LAN 2009 »

Data.List.Split in Haskell

08/12/09

Permalink 03:30:11 pm, 963 words
Categories: Code

Data.List.Split in Haskell

Data.List.Split is a nice example of how to engineer a library that’s easy to use but still hides a lot of complexity that’s there if you need it. Data.List.Split “contains a wide range of strategies for splitting lists with respect to some sort of delimiter, mostly implemented through a unified combinator interface". That sounds like something you’ll have to spend two hours reading up on before you can even understand how to compile the thing. Not so: unlike a lot of Haskell libraries, the theory doesn’t get in the way of the API, or the documentation for it.


Let me take a diversion for a minute to talk about good APIs. A good library API should have 3 layers.

Layer 1 is the simple 80% solution: 2 or 3 pre-made functions that do what you usually want with the library. If it’s a splitter, there should be one or two split functions. XML parser: a function that parses XML from a string and another that parses it from a file.

Layer 2 provides an element of customisation: from the pieces in layer 2 you can build a custom equivalent of the ready-made functions in layer 1, just by snapping together the provided pieces or maybe passing different parameters to a constructor.

Only in layer 3 do you have to understand the implementation of the library. That way you can fully replace or override library components with code should you need to sink your hooks deep into the library.

The first place I ran into APIs like this was Python; the Java APIs I’d used at the time didn’t have layer 1 and the Javadoc usually assumed you would always be mucking around in layer 3, so it started off with inheritance. Java was at the time so in love with deep OO hierarchies that library tutorials inevitably started by explaining which classes to override. It was a whole lot of backwork before you could make the XML parser you just installed actually parse.

Of course, this kind of API is common in Haskell, too, but instead of being stultifying, bad APIs in Haskell are simply inscrutable. Typically they refer repeatedly to the newish, esoteric style that the library is written in rather than trying to actually teach you how to use the library. Knowing that arrows make an XML parser easy to write and reuse doesn’t tell me how to make the library give me a function that’s of type String -> XML. And since I don’t know arrows, it means that I have to do a bunch of reading before I can use layer 2 OR 3 to build the layer 1 that should have been there in the first place.


But enough about bad APIs. Data.List.Split is a good API. Obviously, if your library is named Split, the main function should be named split. But immediately you have a problem: does split take one element to split on? A substring? Multiple single elements? In other words, what does this mean?

split " \t\n" "foo bar \t\nbaz\n"
["foo", "bar", "baz"]
-- OR
["foo bar", "baz\n"]

You want the first, all the way up until you want the second. In languages where split is built-in, the designer usually just chooses the most common one. Python, for example, gives you the second.*

Data.List.Split instead calls the first option splitOn and the second splitOneOf. It gives you both. It also has endBy and endByOneOf for things like line-ending semi-colons, plus a few higher-order versions. This is Haskell, after all. You gotta have the higher-order functions.

Of course this is semantics, so I skipped about a million possibilities. What happens if you have multiple delimiters in a row? Delimiters at the beginning or end? What if you want to keep the delimiters?

split " \t\n" "foo bar \t\nbaz\n"
["foo", "bar", "baz"]
-- OR
["foo bar", "baz\n"]
-- OR
["foo", "bar", "", "", "baz", ""]
-- OR
["foo", "bar", "", "", "baz"]
-- OR
["foo", " ", "bar", " ", "\t", "\n", "baz", "\n"]
-- OR
["foo", " ", "bar", " \t\n", "baz", "\n"]
-- OR
["foo", " ", "bar", " ", "\t", "\n", "baz"]
-- OR
["foo", " ", "bar", " \t\n", "baz"]
-- etc etc

So even if the library (or language) author has provided you with two or three built-ins, often you’ll run into a case where you’ll have to do some pre- or post-processing to get the right answer. Or, in the case of Data.List.Split, you could fall back to layer 2. For example, today I needed function a lot like splitOneOf, except that it didn’t produce empty strings for multiple consecutive delimiters, and it also trimmed off the delimiters from the front and back of the list. Here’s where the API documentation really shines. When I look up splitOneOf, it says that it’s equivalent to split . dropDelims . oneOf. So I did this (pardon the dumb name):

splitTrim = split $ dropInitBlank $ dropFinalBlank $ condense $ 
            dropDelims $ oneOf " \t\n"

condense stops split from producing empty strings between consecutive delimiters. Then dropInitBlank and dropFinalBlank make the trimming happen.

Then I read a little further in the documentation and saw that I could combine the first three into dropBlanks:

splitTrim = split $ dropBlanks $ dropDelims $ oneOf " \t\n"

So I got exactly the function I wanted just by sticking together existing pieces, ones that I didn’t have to write.

At this point, I haven’t needed to do anything with layer 3 of the Data.List.Split API. It looks like you can write your own splitting strategies and strategy transformers, and that a splitter object tracks policy for a number of properties. Honestly, I agree with the author who says, “If you find yourself wanting something more complicated or optimized, it probably means you should use a real parsing or regular expression library.” But, hey! It’s there if you want it!

*It gives you the first, for whitespace only, if you pass no arguments to split.

No feedback yet

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution free blog software