| « Final Fantasy 12 Review | I finally understand Haskell's List monad » |
O’Reilly is about to publish a Haskell book called Real World Haskell. Now that Haskell has an O’Reilly book, it’s finally a possible 3rd-tier language. Fortunately, the text is online and it looks as if it will remain there, so you can read it there if you’re cheap. In any case the book hasn’t been released yet so right now it’s the only way. I’ve read the first half—it’s good, although the explanations of esoteric things are confusing. That’s OK, since it addresses a hole in the Haskell literature, and you can read about monads or arrows elsewhere. It is eminently useful for learning to write simple real programs.
This is fortuitous because I recently decided to try Haskell once again for real development. I wrote my cute little tree kernel in Haskell and the type errors weren’t too painful. And I have been looking for a replacement for Python 3.0, which is a revision of the language that will put a stop to a few of the tricks I use to do functional programming there.
The fact is that functional programming is not welcome in Python because Guido believes it is unreadable, or perhaps for some other reason; all I am sure of is that he wants to restrict its use. I believe that he has been experimenting throughout the 2.x series to find the good parts of functional programming—the Pythonic parts. 3.0 attempts to sweep out the confusing and easily abused parts.
For example, it’s a relatively minor loss, but tuple unpacking (aka list destructuring aka pattern matching) is no longer available on argument lists. You are restricted to assignment (x,y = (1,2)) and iteration (for x,y in zip(xs,ys): …). But tuple unpacking is useful for simple data structures. It’s especially useful for prototyping because you aren’t required to create a lot of class machinery to get most of the benefits of a struct. Example:
class BasicallyAStruct():
def __init__(self, field1, field2, field3):
this.field1 = field1
this.field2 = field2
this.field3 = field3
x = BasicallyAStruct(1,2,3)
print (lambda bas: bas.field1)(x)
# vs
x = (1,2,3)
print (lambda (field1,field2,field3): field1)(x)
More sweepingly, list comprehensions have proved themselves and been expanded to include generator comprehensions. Iterators will be everywhere in 3.0. Unfortunately, these two changes restrict functional programming to specific contexts; you must advertise that you are doing functional programming by sticking your code inside a comprehension. It’s almost the same attitude that Haskell takes toward I/O. And since iterators are use-once, even assignment isn’t safe for an iterator: you can’t assign an iterator to a variable and use it twice, because it will be used up the second time. So iterators are only *supposed* to appear in syntactic iteration contexts—comprehensions and for loops. No passing them around because you can lose track of whether they have been used up yet.
This is fine; Python aims to be a simple language with only one obvious way to do things, and as the 2.x series marches on, this has become more and more of a pretence. 3.0 tidies the language in accordance with its philosophy. But it’s not for me—I want a functional language, and Python’s remaining attraction is that it’s good with text. So I am on the move, or will be in a couple of years when 3.0 is the principal option available for Python programmers. And by ‘on the move’, I mean that I need a different language for prototyping. That means a functional language that’s decent at text manipulation. Since I have been writing types on a lot of my formerly-prototype code anyway, I thought I would give Haskell another go to see if I could stomach Debugging By Type Error instead of Debugging By Exception.
So I started by translating my phonology distance code to Haskell. It was already functional code, with a single read of a CSV at the beginning, so I figured it would be a good place to start. I also learned how to use Haskell’s package manager to download a CSV library. It was easy, although I had to install it first—it didn’t come with my distro.
The result was fairly good—90 lines compared to 110, and a lot of the utilities I defined, like takeWhile and curry, are built in. I’m still getting used to debugging before running the program, but I think I will eventually get into it.
So I decided to see if Haskell could compete with Python on its home ground of text processing. As an example, I used this problem: I wanted to find out which imports I relied on most in Python so that I would know which modules I would have to replace first. Pretty simple stuff, probably 3 lines in Perl. Here is the Haskell code:
import qualified Data.Map as Map
import Data.List (isPrefixOf, sortBy)
import Data.Ord (comparing)
import Text.Regex
--- utils ---
(&) = flip (.)
count xs = Map.fromListWith (+) (zip xs (repeat 1))
--- code ---
main = readFile "python_source.txt" >>= lines & readFiles >>= combine & output
readFiles = mapM (\ f -> readFile f >>= lines & parse & return)
parse = count . concatMap extract
extract line
| "import" `isPrefixOf` line = [drop (length "import ") line]
| otherwise =
format (matchRegex (mkRegex "^from ([a-zA-Z]+) import (.+)$") line)
format (Just [mod,imports]) =
[mod++"."++name | name <- splitRegex (mkRegex ",") imports]
format _ = []
combine = Map.unionsWith (+) & Map.toList & sortBy (comparing snd)
output = mapM_ (\ (name,n) -> putStrLn (name++":\t"++show n))
In case you’re not familiar with Haskell, here are some notes. I defined the operator & to mimic piping in the shell (| is already taken by the language). Compose (.) gets hard to read when you compose more than three things because you have to read the code backward. count is the classic count-list-items-into-a-map code. I do this by pairing every item in the list with the initial value 1 and then telling the Map constructor to combine duplicates with (+).
main is just a list of actions to take in order, feeding the output of one into the next. The only wrinkle is that because of Haskell’s separation of I/O, you have to use >>=, not &, to pipe functions that read from disk. In fact, that’s a dramatic oversimplification—I bet combining I/O types with the rest of the system is one of the biggest obstacles to getting people running with Haskell. Anyway, main just reads the file “python_source.txt", splits it into its constituent lines, reads each file in that list, then combines the resulting maps and then outputs them (somewhat) nicely. The rest of the code is not too surprising, I hope, except for the use of mapM instead of map. That’s needed because readFiles calls readFile and Haskell requires all I/O to be marked with a wrapper type. mapM is a helper that manages wrapping for map (sort of). Like I said, not at all obvious.
The thing that bothers me the most about Real World programming in Haskell right now is the reliance on tree maps and linked lists as the basic data structures. The set of efficient operations is smaller and as a result you have to be more clever more often when writing code. For example, count above is O (n log n), while the Python equivalent is wordier but only O(n) because hash maps are the basis for dictionaries. In any case, I am pretty sure that random-access strings, arrays and hash maps are available, but they are not the first choice of implementation.
For comparison, here is the Python code:
from util.lst import mapn, fst, snd
from util import dct
import re
### code ###
def readFiles(lines):
return [parse(open(fname.strip())) for fname in lines]
def parse(lines):
return dct.count(mapn(extract, lines))
def extract(line):
if line.startswith('import'):
return [line[7:].strip()]
else:
return format(re.match("^from ([a-zA-Z]+) import (.+)$", line))
def format(match):
if match:
return ["%s.%s" % (match.group(1), name)
for name in match.group(2).strip().split(',')]
else:
return []
def combine(ds):
return dct.map(sum, dct.zip(default=0, *ds))
def output(d):
return '\n'.join('%s:\t%s' % pair for pair in sorted(d.items(), key=snd))
print output(combine(readFiles(open('python_source.txt'))))
This is arguably easier to read for the average programmer, at least one who has gone to read the documentation on my util.dct module, but the important part is that for prototyping, I don’t care. I’m the only one using my code. Sure, I’m a “different person” a year later, but that doesn’t mean my knowledge of Haskell melts away. If I have to change something in a year, I really want the ability to say :browse Sed to have the language tell me the types. I should know—this is what I missed about Python a couple of months ago.
In case you are interested, here are the results of the script for modules I used 10 or more times. I have no idea how much code this represents; it’s all python files inside ~/Documents and ~/src.
Notice that the number one import was operator, which allows you to pass operators as higher-order functions. Built-in to Haskell. The first module with code that I wrote is util.dct, not too surprising considering the prevalence of dictionaries in Python and their lack of a functional interface. A large body of this code is a 4-year-old web app, so that explains the prevalence of things like htmlforms and util.web. Also I use re a lot for not liking it very much.
And now for something completely related: trimming a string in Haskell and XSLT, both purely functional languages. I prefer the verbose, slow Haskell version, myself.
tr '\n' '\0' < python_source .txt | xargs -0 cat | egrep "^(from [a-zA-Z]+ )?import " | sed "/^from \([a-zA-Z]*\) import \([^,]*\),\?\(.*\)/{s//\1.\2\n\1:\3/;:l;P;s/.*\n//;tx;:x;s/^\([^:]*\):\([^,]\+\),\?/\1.\2\n\1:/;tl;d;};s/^import //" | sort | uniq -c | sort -g python_source> MyStruct = namedtuple ('MyStruct', 'field1 field2 field3')
x = MyStruct(1,2,3)
print (lambda bas: bas.field1)(x)
lasi@truenadell:/tmp$ cat visitor.py
from compiler import walk, parseFile
class ImportVisitor:
def visitImport(self, t):
for n in t.names:
print n[0]
def visitFrom(self, t):
print t.modname
if __name__ == '__main__':
import sys
for i in sys.argv[1:]:
walk(parseFile(i), ImportVisitor())
lasi@truenadell:/tmp$ python visitor.py visitor.py
compiler
sys
lasi@truenadell:/tmp$