« Final Fantasy 12 ReviewI finally understand Haskell's List monad »

Real World Haskell

08/09/08

Permalink 10:51:34 am, 1648 words
Categories: Code

Real World Haskell

O’Reilly is about to publish a Haskell book called Real World Haskell. Now that Haskell has an O’Reilly book, it’s finally a possible 3rd-tier language. Fortunately, the text is online and it looks as if it will remain there, so you can read it there if you’re cheap. In any case the book hasn’t been released yet so right now it’s the only way. I’ve read the first half—it’s good, although the explanations of esoteric things are confusing. That’s OK, since it addresses a hole in the Haskell literature, and you can read about monads or arrows elsewhere. It is eminently useful for learning to write simple real programs.

This is fortuitous because I recently decided to try Haskell once again for real development. I wrote my cute little tree kernel in Haskell and the type errors weren’t too painful. And I have been looking for a replacement for Python 3.0, which is a revision of the language that will put a stop to a few of the tricks I use to do functional programming there.

The fact is that functional programming is not welcome in Python because Guido believes it is unreadable, or perhaps for some other reason; all I am sure of is that he wants to restrict its use. I believe that he has been experimenting throughout the 2.x series to find the good parts of functional programming—the Pythonic parts. 3.0 attempts to sweep out the confusing and easily abused parts.

For example, it’s a relatively minor loss, but tuple unpacking (aka list destructuring aka pattern matching) is no longer available on argument lists. You are restricted to assignment (x,y = (1,2)) and iteration (for x,y in zip(xs,ys): …). But tuple unpacking is useful for simple data structures. It’s especially useful for prototyping because you aren’t required to create a lot of class machinery to get most of the benefits of a struct. Example:

class BasicallyAStruct():
  def __init__(self, field1, field2, field3):
    this.field1 = field1
    this.field2 = field2
    this.field3 = field3
x = BasicallyAStruct(1,2,3)
print (lambda bas: bas.field1)(x)
# vs
x = (1,2,3)
print (lambda (field1,field2,field3): field1)(x)

More sweepingly, list comprehensions have proved themselves and been expanded to include generator comprehensions. Iterators will be everywhere in 3.0. Unfortunately, these two changes restrict functional programming to specific contexts; you must advertise that you are doing functional programming by sticking your code inside a comprehension. It’s almost the same attitude that Haskell takes toward I/O. And since iterators are use-once, even assignment isn’t safe for an iterator: you can’t assign an iterator to a variable and use it twice, because it will be used up the second time. So iterators are only *supposed* to appear in syntactic iteration contexts—comprehensions and for loops. No passing them around because you can lose track of whether they have been used up yet.

This is fine; Python aims to be a simple language with only one obvious way to do things, and as the 2.x series marches on, this has become more and more of a pretence. 3.0 tidies the language in accordance with its philosophy. But it’s not for me—I want a functional language, and Python’s remaining attraction is that it’s good with text. So I am on the move, or will be in a couple of years when 3.0 is the principal option available for Python programmers. And by ‘on the move’, I mean that I need a different language for prototyping. That means a functional language that’s decent at text manipulation. Since I have been writing types on a lot of my formerly-prototype code anyway, I thought I would give Haskell another go to see if I could stomach Debugging By Type Error instead of Debugging By Exception.

So I started by translating my phonology distance code to Haskell. It was already functional code, with a single read of a CSV at the beginning, so I figured it would be a good place to start. I also learned how to use Haskell’s package manager to download a CSV library. It was easy, although I had to install it first—it didn’t come with my distro.

The result was fairly good—90 lines compared to 110, and a lot of the utilities I defined, like takeWhile and curry, are built in. I’m still getting used to debugging before running the program, but I think I will eventually get into it.

So I decided to see if Haskell could compete with Python on its home ground of text processing. As an example, I used this problem: I wanted to find out which imports I relied on most in Python so that I would know which modules I would have to replace first. Pretty simple stuff, probably 3 lines in Perl. Here is the Haskell code:

import qualified Data.Map as Map
import Data.List (isPrefixOf, sortBy)
import Data.Ord (comparing)
import Text.Regex
--- utils ---
(&) = flip (.)
count xs = Map.fromListWith (+) (zip xs (repeat 1))
--- code ---
main = readFile "python_source.txt" >>= lines & readFiles >>= combine & output
readFiles = mapM (\ f -> readFile f >>= lines & parse & return)
parse = count . concatMap extract
extract line
  | "import" `isPrefixOf` line = [drop (length "import ") line]
  | otherwise =
      format (matchRegex (mkRegex "^from ([a-zA-Z]+) import (.+)$") line)
format (Just [mod,imports]) =
  [mod++"."++name | name <- splitRegex (mkRegex ",") imports]
format _ = []
combine = Map.unionsWith (+) & Map.toList & sortBy (comparing snd)
output = mapM_ (\ (name,n) -> putStrLn (name++":\t"++show n))

In case you’re not familiar with Haskell, here are some notes. I defined the operator & to mimic piping in the shell (| is already taken by the language). Compose (.) gets hard to read when you compose more than three things because you have to read the code backward. count is the classic count-list-items-into-a-map code. I do this by pairing every item in the list with the initial value 1 and then telling the Map constructor to combine duplicates with (+).

main is just a list of actions to take in order, feeding the output of one into the next. The only wrinkle is that because of Haskell’s separation of I/O, you have to use >>=, not &, to pipe functions that read from disk. In fact, that’s a dramatic oversimplification—I bet combining I/O types with the rest of the system is one of the biggest obstacles to getting people running with Haskell. Anyway, main just reads the file “python_source.txt", splits it into its constituent lines, reads each file in that list, then combines the resulting maps and then outputs them (somewhat) nicely. The rest of the code is not too surprising, I hope, except for the use of mapM instead of map. That’s needed because readFiles calls readFile and Haskell requires all I/O to be marked with a wrapper type. mapM is a helper that manages wrapping for map (sort of). Like I said, not at all obvious.

The thing that bothers me the most about Real World programming in Haskell right now is the reliance on tree maps and linked lists as the basic data structures. The set of efficient operations is smaller and as a result you have to be more clever more often when writing code. For example, count above is O (n log n), while the Python equivalent is wordier but only O(n) because hash maps are the basis for dictionaries. In any case, I am pretty sure that random-access strings, arrays and hash maps are available, but they are not the first choice of implementation.

For comparison, here is the Python code:

from util.lst import mapn, fst, snd
from util import dct
import re
### code ###
def readFiles(lines):
    return [parse(open(fname.strip())) for fname in lines]
def parse(lines):
    return dct.count(mapn(extract, lines))
def extract(line):
    if line.startswith('import'):
        return [line[7:].strip()]
    else:
        return format(re.match("^from ([a-zA-Z]+) import (.+)$", line))
def format(match):
    if match:
        return ["%s.%s" % (match.group(1), name)
                for name in match.group(2).strip().split(',')]
    else:
        return []
def combine(ds):
    return dct.map(sum, dct.zip(default=0, *ds))
def output(d):
    return '\n'.join('%s:\t%s' % pair for pair in sorted(d.items(), key=snd))
print output(combine(readFiles(open('python_source.txt'))))

This is arguably easier to read for the average programmer, at least one who has gone to read the documentation on my util.dct module, but the important part is that for prototyping, I don’t care. I’m the only one using my code. Sure, I’m a “different person” a year later, but that doesn’t mean my knowledge of Haskell melts away. If I have to change something in a year, I really want the ability to say :browse Sed to have the language tell me the types. I should know—this is what I missed about Python a couple of months ago.

In case you are interested, here are the results of the script for modules I used 10 or more times. I have no idea how much code this represents; it’s all python files inside ~/Documents and ~/src.

  1. operator: 38
  2. sys: 32
  3. util.dct: 32
  4. os: 31
  5. unittest: 28
  6. util.text: 24
  7. htmlforms: 22
  8. compat: 21
  9. itertools: 21
  10. util.lst: 20
  11. re: 19
  12. util.data_struct: 18
  13. config: 16
  14. util.web: 16
  15. cgi: 15
  16. util.fnc: 15
  17. unifeat: 15
  18. common: 14
  19. string: 12
  20. utils: 12
  21. random: 10

Notice that the number one import was operator, which allows you to pass operators as higher-order functions. Built-in to Haskell. The first module with code that I wrote is util.dct, not too surprising considering the prevalence of dictionaries in Python and their lack of a functional interface. A large body of this code is a 4-year-old web app, so that explains the prevalence of things like htmlforms and util.web. Also I use re a lot for not liking it very much.

And now for something completely related: trimming a string in Haskell and XSLT, both purely functional languages. I prefer the verbose, slow Haskell version, myself.

13 comments

Comment from: Rodrigo Queiro [Visitor] Email
Why not use a framework that has been designed from the ground up for string manipulation and string manipulation alone?

tr '\n' '\0' < python_source .txt | xargs -0 cat | egrep "^(from [a-zA-Z]+ )?import " | sed "/^from \([a-zA-Z]*\) import \([^,]*\),\?\(.*\)/{s//\1.\2\n\1:\3/;:l;P;s/.*\n//;tx;:x;s/^\([^:]*\):\([^,]\+\),\?/\1.\2\n\1:/;tl;d;};s/^import //" | sort | uniq -c | sort -g
08/09/08 @ 14:24
Comment from: sandersn [Member]
Why not? Because most of my work doesn't involve string manipulation. So I don't need a concise string DSL like bash+sed, just something that's reasonably convenient.

This simple string processing task is actually even more complicated than my normal needs--most of the time I just have to read in a file line-by-line and then process the strings a little so they can be treated like symbols. That's why I figured it would be a good comparison between Haskell and Python.
08/09/08 @ 17:45
Comment from: Name [Visitor] Email
I've never been fond of tuple unpacking in parameter lists, so perhaps I'm biased, but the "named tuple" type that has been adopted into the stdlib in 2.6 and 3.0 seems an excellent replacement. Your code would be written as:

MyStruct = namedtuple ('MyStruct', 'field1 field2 field3')

x = MyStruct(1,2,3)
print (lambda bas: bas.field1)(x)


----

Iterables are not one-use; that's a specific type of iterable, the generator. Generators are designed to cleanly wrap operations that are 1) non-repeatable or 2) infinite. Functional programming is easily performed on any iterable, even generators, and the "itertools" module is growing more functions in 3.0 to help.

----

I do not understand your statement that "Unfortunately, these two changes restrict functional programming to specific contexts; you must advertise that you are doing functional programming by sticking your code inside a comprehension". There are no functional features being removed -- the closest thing is moving "reduce" into a module.
08/09/08 @ 18:46
Comment from: unm [Visitor] Email
what was the performance difference?
08/09/08 @ 20:05
Comment from: javier lasheras [Visitor] Email
Your code is interesting. I tried do with python ast trees:

lasi@truenadell:/tmp$ cat visitor.py
from compiler import walk, parseFile

class ImportVisitor:
  def visitImport(self, t):
    for n in t.names:
      print n[0]
  def visitFrom(self, t):
    print t.modname

if __name__ == '__main__':
  import sys
  for i in sys.argv[1:]:
    walk(parseFile(i), ImportVisitor())
lasi@truenadell:/tmp$ python visitor.py visitor.py
compiler
sys
lasi@truenadell:/tmp$

I like functional programming too, but in this case OOP isn't bad.
09/09/08 @ 08:05
Comment from: sandersn [Member]
@unm: interpreted Haskell was noticeably slower. I'm pretty sure it was the tree maps doing it, so I doubt compiling would speed things up much.

@name:
(1)
Thanks, I'll take a look at it. I'm not sure it fits all of my uses because accessing nested structures still loses:

lambda (_,(x,l)): [x + y for y in l]
vs
lambda o: [o.sub.x + y for y in o.sub.l]

OK, I guess it's not that bad.

(2)
>>> d = dict.fromkeys(range(4))
>>> import itertools
>>> def repetitious(it):
... for y in [1,2,3]:
... print zip(itertools.repeat(y), iter(it))
...
>>> x = d.iterkeys()
>>> repetitious(x)
[(1, 0), (1, 1), (1, 2), (1, 3)]
[]
[]
>>> repetitious(x)
[]
[]
[]
>>> list(x)
[]

So even if repetitious worked (probably by copying its iterator with list?), you still can't use it without knowing that it will use up your iterator.

(3)
To get around this, you have to write your own generator that will produce a new iterator each time either explicitly wrap your code in list (saying: "Here is functional code") or you can pass a new generator based on an original generator:

repetitious(k for k in d.iterkeys())
09/09/08 @ 08:27
Comment from: sandersn [Member]
@javier:
Very nice. Where did you learn how to use compiler.walk? The standard documentation is um, non-existent. I saw the package when they added it but never used it for that reason.
09/09/08 @ 08:37
Comment from: javier lasheras [Visitor] Email
In compiler.visitor package with pydoc you'll get a tiny help.

With google you can see examples like:
http://www.biais.org/blog/index.php/2007/01/10/9-visit-python-abstract-syntax-tree

;-)
09/09/08 @ 09:34
Comment from: Brandon Bickford [Visitor] Email
You should get a significant boost by compiling with ghc instead of running your program in the interpreter.
09/09/08 @ 10:42
Comment from: Don Stewart [Visitor] Email · http://www.cse.unsw.edu.au/~dons
I'd agree strongly with Brandom, ghci is on average 30x slower than ghc -O2. That's why we have a native code, optimising compiler for Haskell: to produce fast code.

Like this, http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=ghc&lang2=python
11/09/08 @ 23:37
Comment from: sandersn [Member]
@Brandon, dons

I tried
ghc -o pythonimports -O2 --make PythonImports.hs

It is much faster but still a little slower than Python. Here's the output from time:

python:
real 0m0.306s
user 0m0.281s
sys 0m0.025s

ghc:
real 0m0.758s
user 0m0.712s
sys 0m0.037s

runghc:
real 0m2.046s
user 0m1.694s
sys 0m0.921s

Anyway this is not a big deal for me because I don't really care about speed until I have to run a full experiment and I can optimise at that point. (Or rewrite in C, though I hope that happens less often for Haskell)

Aside: dons, if you are still reading, I was curious about your reddit comment about (&). Is there some reason to prefer (.) ? My reason for (&) is that I don't like reading long strings of composed functions--they read from right-to-left.
12/09/08 @ 07:56
Comment from: Tim MacEachern [Visitor] Email
I'm writing not about the content, which is great, but about the presentation. When I try to view this page in Firefox, for some reason I don't see all the Haskell code. In particular, the "& output" in the definition of main is not reproduced. Check the difference between the Firefox 3 display and the view source. I think this is just a line truncation, but for a Haskell newbie like me, it's a puzzle how output gets called.

In IE, the code is chopped up even more, but here the effect is more obvious so you know what's going on.

Anyway, thanks for the Haskell writing. It really is a language you have to absorb bit by bit.
29/01/09 @ 09:19
Comment from: sandersn [Member]
@Tim
Thanks for telling me. I will check it with Firefox. I think this theme is fixed width, unfortunately. I'll see what I can do to fix this, maybe making my pre tags scroll if they exceed a certain width.
29/01/09 @ 10:07

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
blog software