« Developing Fing, part 1: Parsec tokeniserImpressions of a bunch of game demos »

Programming language makeup of my dialectology dissertation

01/04/10

Permalink 10:37:11 am, 743 words
Categories: Code, Linguistics

Programming language makeup of my dialectology dissertation

This week, Josh was surprised to learn that my entire dissertation isn’t in Haskell. Actually, the core is written in C++, but there are a couple of reasons for that. First, I wrote most of the code before I started using Haskell. Second, the original version of the program had to prioritise memory efficiency over speed because of the experiment I was running at the time. This meant that the Python version was useless, and the Caml version was unbearably slow and often failed on our 2 GB corpus server. The garbage collection overhead, either in space (for maximum efficiency) or in time (for minimum space) was greater than the stupid inefficiencies in my novice C++ code.

Even now, a naive Haskell version is about twice as slow as the (still fairly naive) C++ version, and it’s not worth the effort to switch. Still, only the core is written in C++. The wrapper that runs the whole thing is Python, the myriad of data transformation programs are Haskell, and the statistics checking at the end is R. Also, there’s a lot of Java which I didn’t write in MaltParser and the Berkeley parser. Lots.

Here are the numbers:

Python

     255 build.py
     152 consts.py
      51 norte.py
      44 test.py
     502 total

C++

      38 icedist.cpp
      21 icefeat.cpp
      18 icesig.cpp
     258 icecore.h
     125 iceextra.h
     460 total

Haskell

     10 CalculateGeoDistance.hs
       8 CombineFeatures.hs
     113 Consensus.hs
     174 Consts.hs
      65 ConvertBerkeleyToFeature.hs
       6 ConvertDistToL04.hs
      20 ConvertMaltToFeature.hs
      10 ConvertPTBToTags.hs
      14 ConvertTagsToConll.hs
      10 ConvertTagsToFeature.hs
      40 ConvertTalbankenToPTB.hs
      12 ConvertTalbankenToTags.hs
       6 ConvertTntToTxt.hs
      55 Distance.hs
      31 FormatDistance.hs
      12 RankFeatures.hs
       7 Sexp.hs
      47 Swedia.hs
      11 Talbanken.hs
      65 TestConvertTags.hs
     174 TestConvertTalbanken.hs
      49 TestDistance.hs
      55 TestPath.hs
     170 TestSwedia.hs
      16 TriangleInequality.hs
      55 Util.hs
    1235 total

R*

      63 montecarlo Mantel example.R
      57 genAnalysis.R
     120 total

So, yes, the majority of the code for my dissertation is actually Haskell. A lot of that is in an area where Haskell is not thought of as strong: text-file format munging. That’s what all those Convert files are. But really the main difficulty is learning to keep the monadic and pure variants of functions straight. The usage of map vs mapM is not immediately obvious, and the type errors you get are confusing. The good news is that once learned, the patterns are pretty obvious. But that’s the story of Haskell for everything.

(Also you need a good way to format strings. I’ve got by with unwords and intercalate so far.)

Why not Python instead of Haskell?

So why not Python for text-file munging? Three reasons: the weakest is that functional code in Python is ugly. The second is that functional code is either inefficient (Python 2.x) or unreliable (Python 3.x) compared to Haskell. Haskell is fully lazy and has functional data structures. That means that functional code behaves the same way as imperative code—processing is line-by-line and generated incrementally, which makes debugging a lot easier. This is not true in Python 2.x and requires manual analysis of laziness in Python 3.x (since iterators must be explicitly cached to lists where necessary).

The third reason is that functional code is unmaintainable by default in Python. The text-munging I do is not your run-of-the-mill Perl sweet-spot of per-line regex processing; much of it involves building a tree representation and then manipulating it (there is a lot of parsing in my dissertation after all). So, intermingled with the text-munging is some fairly complex code. Python will punish you for writing complex functional code without documentation. In contrast, Haskell either generates the documentation for you (eg :browse Module) or requires declarations up front (eg data Tree a = Leaf | Node a [Tree a]).

I think what surprised Josh is that I talk a lot more about Haskell than about Python. But I use the right tool for the job**, and Python’s place is making imperative code simple, so that’s where I use it. (I wrote an awesome parallel execution function that’s compatible with Python 2.6 and 3.0, for example). And I use C++ when I need maximum memory efficiency and speed in the same program.

*I didn’t write the first R file. Stephanie Dickinson of the IU Statistics Consulting Center department did. If you have a single statistics question, this is an invaluable place. Ask them what statistics method you need, and they will not only tell you, but even write code for you to run it.

**Of course, my understanding of these tools might be flawed. In particular, my view of Python’s applicability may be artificially restricted. And my view of C++ may be inflated…

No feedback yet

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution