Category: Linguistics

Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 20 >>

05/06/10

Permalink 09:16:53 pm, 899 words
Categories: Code, Linguistics

Stack Overflow, The Death of Words and Type Classes

C S Lewis has a perceptive essay called “The Death of Words". You can find the title screen here; you can poke around and probably convince Google to let you read most of the essay. Anyway, the problem that he’s complaining about is that the meaning of words tend to change over time. A lot of words that meant something just mean “good” now, partly because we liked the original thing. This is well-known in historical linguistics, so maybe Lewis picked it up from hanging around Tolkien. Lewis captures this in the quote:

“And as long as most people are more anxious to express their likes and dislikes, this must remain a universal truth about language.”

He starts giving examples and a couple of paragraphs hits upon the word which, in the 21st century, is almost beyond dead. People have almost stopped using it, even to mean “good", except in computer science. Lewis says, “Modern … has ceased to be a chronological term … and often means little more than ‘efficient’".

A recent question on Stack Overflow about dynamic scope has an answer by Eli Bendersky. The answer is great, but it’s also a good example of how dead ‘modern’ is. Here’s the end of his post (emphasis mine).

That said, lexical binding is IMHO much better for 99% of the cases. Note that modern Lisps are not dynamic-binding-only like Emacs lisp.

  • Common Lisp supports both forms of binding, though the lexical one is used much more
  • The Scheme specification doesn’t even specify dynamic binding (only lexical one), though many implementations support both.

In addition, modern languages like Python and Ruby that were somewhat inspired by Lisp usually support lexical-binding in a straightforward way, with dynamic binding also available but less straightforward.

Now, Bendersky frequently gives insightful answers about all kinds of Lisps and the P-languages descended from Lisp. I am sure he knows that Scheme (1975) is 10 years older than Emacs Lisp (1985) and Common Lisp is a year older (1984). I am sure he also knows that Python (1991) is only 6 years younger and Ruby (1995) 10 years. The reason E-Lisp didn’t have lexical scope is not because it’s less modern than the other languages. It’s because it was a mediocre scripting language from its inception. The fact that it hasn’t added lexical scope in the intervening 25 years means that it’s now a bad scripting language. Look: Python didn’t have any nested scope until 2.0. But at least Guido added it. No such luck with E-Lisp.

The word “modern” is dead. It just means “good” here, specifically “has good features". But this meaning doesn’t survive from previous usage; it’s picked up from context. And it really annoys me. It would be a lot better just to say “Note that good Lisps…” and “In addition, “good languages like Python and Ruby…". Even better, be substantive in your dislike: “In addition, “Note that full-featured Lisps…” and “In addition, well-designed languages like Python and Ruby…”

By the way, those substantive reasons mean that Emacs would be better off with any of the languages in the preceding paragraph besides E-Lisp, by the way. They’re all better languages, even Common Lisp, which hasn’t changed since 1994. If somebody could port all the millions and millions lines of E-Lisp, or least provide backward compatibility, it would probably extend Emacs’ popularity ten-fold. (I was going to say “life span", but I’m convinced that’s converging on co-equal with that of the human race at this point.)

In conclusion, Lispers, don’t mince words: E-Lisp is a bad Lisp, not because it’s not modern, but because it’s badly designed and feature-free.


The second question I noticed today is related to my recent post trying to emulate some uses of typeclasses with “standard” (ie Java) OO. Here, the question is “how similar are Go’s interfaces and Haskell’s type classes?”

Answer: just as similar as standard OO’s interfaces, because Go’s interfaces are no more powerful than Java’s. They’re just type-checked structurally. That’s cool, because implementation is implicit and can be done by anybody, not just the original author. But it’s nowhere near as powerful as typeclasses.

Of course systems with delayed typing (like Ruby’s mixins and C++’s templates) can be as powerful as typeclasses; by the time the type is checked, the type variables have been fully instantiated so the checker sees nothing but concrete types. But…it also means your error messages also contain nothing but concrete types; that’s the trade-off.

I guess the only reason for this question is that Go only has two tutorial documents, and the concept of structural typing is so unfamiliar that people stop to re-examine their assumptions about what you can and can’t do. I did, but after re-reading the Go tutorial I’m convinced that you don’t get any extra power above Java.

Update: Norman Ramsey chimed in a few minutes ago to point out that Haskell’s type system is equivalent in power to cut-free Prolog, ie “pure” Prolog, ie mini-Kanren. So if I ever want to subject myself to that pain again, I know I can do it without leaving Haskell.


Edit:

Lispers have been abusing ‘modern’ for a long time. While packing for my upcoming move, I ran across Eugene Charniak’s AI book from 1985. The Lisp tutorial chapter refers to Scheme as a “modern” Lisp and Franz Lisp as a “less modern” Lisp. Yup, still backwards: Scheme: 1975; Franz Lisp: 1978. (Although it is based on MacLisp, which is an older design.)

01/04/10

Permalink 10:37:11 am, 743 words
Categories: Code, Linguistics

Programming language makeup of my dialectology dissertation

This week, Josh was surprised to learn that my entire dissertation isn’t in Haskell. Actually, the core is written in C++, but there are a couple of reasons for that. First, I wrote most of the code before I started using Haskell. Second, the original version of the program had to prioritise memory efficiency over speed because of the experiment I was running at the time. This meant that the Python version was useless, and the Caml version was unbearably slow and often failed on our 2 GB corpus server. The garbage collection overhead, either in space (for maximum efficiency) or in time (for minimum space) was greater than the stupid inefficiencies in my novice C++ code.

Even now, a naive Haskell version is about twice as slow as the (still fairly naive) C++ version, and it’s not worth the effort to switch. Still, only the core is written in C++. The wrapper that runs the whole thing is Python, the myriad of data transformation programs are Haskell, and the statistics checking at the end is R. Also, there’s a lot of Java which I didn’t write in MaltParser and the Berkeley parser. Lots.

Here are the numbers:

Python

     255 build.py
     152 consts.py
      51 norte.py
      44 test.py
     502 total

C++

      38 icedist.cpp
      21 icefeat.cpp
      18 icesig.cpp
     258 icecore.h
     125 iceextra.h
     460 total

Haskell

     10 CalculateGeoDistance.hs
       8 CombineFeatures.hs
     113 Consensus.hs
     174 Consts.hs
      65 ConvertBerkeleyToFeature.hs
       6 ConvertDistToL04.hs
      20 ConvertMaltToFeature.hs
      10 ConvertPTBToTags.hs
      14 ConvertTagsToConll.hs
      10 ConvertTagsToFeature.hs
      40 ConvertTalbankenToPTB.hs
      12 ConvertTalbankenToTags.hs
       6 ConvertTntToTxt.hs
      55 Distance.hs
      31 FormatDistance.hs
      12 RankFeatures.hs
       7 Sexp.hs
      47 Swedia.hs
      11 Talbanken.hs
      65 TestConvertTags.hs
     174 TestConvertTalbanken.hs
      49 TestDistance.hs
      55 TestPath.hs
     170 TestSwedia.hs
      16 TriangleInequality.hs
      55 Util.hs
    1235 total

R*

      63 montecarlo Mantel example.R
      57 genAnalysis.R
     120 total

So, yes, the majority of the code for my dissertation is actually Haskell. A lot of that is in an area where Haskell is not thought of as strong: text-file format munging. That’s what all those Convert files are. But really the main difficulty is learning to keep the monadic and pure variants of functions straight. The usage of map vs mapM is not immediately obvious, and the type errors you get are confusing. The good news is that once learned, the patterns are pretty obvious. But that’s the story of Haskell for everything.

(Also you need a good way to format strings. I’ve got by with unwords and intercalate so far.)

Why not Python instead of Haskell?

So why not Python for text-file munging? Three reasons: the weakest is that functional code in Python is ugly. The second is that functional code is either inefficient (Python 2.x) or unreliable (Python 3.x) compared to Haskell. Haskell is fully lazy and has functional data structures. That means that functional code behaves the same way as imperative code—processing is line-by-line and generated incrementally, which makes debugging a lot easier. This is not true in Python 2.x and requires manual analysis of laziness in Python 3.x (since iterators must be explicitly cached to lists where necessary).

The third reason is that functional code is unmaintainable by default in Python. The text-munging I do is not your run-of-the-mill Perl sweet-spot of per-line regex processing; much of it involves building a tree representation and then manipulating it (there is a lot of parsing in my dissertation after all). So, intermingled with the text-munging is some fairly complex code. Python will punish you for writing complex functional code without documentation. In contrast, Haskell either generates the documentation for you (eg :browse Module) or requires declarations up front (eg data Tree a = Leaf | Node a [Tree a]).

I think what surprised Josh is that I talk a lot more about Haskell than about Python. But I use the right tool for the job**, and Python’s place is making imperative code simple, so that’s where I use it. (I wrote an awesome parallel execution function that’s compatible with Python 2.6 and 3.0, for example). And I use C++ when I need maximum memory efficiency and speed in the same program.

*I didn’t write the first R file. Stephanie Dickinson of the IU Statistics Consulting Center department did. If you have a single statistics question, this is an invaluable place. Ask them what statistics method you need, and they will not only tell you, but even write code for you to run it.

**Of course, my understanding of these tools might be flawed. In particular, my view of Python’s applicability may be artificially restricted. And my view of C++ may be inflated…

23/03/10

Permalink 02:31:14 pm, 1183 words
Categories: Code, Linguistics

Testing random number generation

Here’s how to test a pseudo-random number generator to make sure that it’s giving you a uniform distribution. Simple, but something that I just now had time to do.

For my research, pseudo-random numbers are perfect; if you use the same seed at the beginning, you get the same numbers back every time. I don’t even care if the stream is predictable by an outside observer; I just need a series of numbers that are uniformly distributed over a range. This is true for many scientific applications.

That means my “randomness” test is pretty simple at the heart: I just have to make sure the distribution is uniform.

Here’s the random number generator I’m using. It’s from The C++ Programming Language by Bjarne Stroustrup.

class Randint {
  unsigned long randx;
public:
  Randint(long s = 0) { randx=s; }
  void seed(long s) { randx=s; }
  //magic numbers chosen to use 31 bits of a 32-bit long
  static long abs(long x) {return x & 0x7fffffff; }
  static double max() { return 2147483648.0; } // note: a double
  long draw() { return randx = randx * 1103515245 + 12345; }
  double fdraw() { return abs(draw()) / max(); } // in the interval [0,1]
  long operator()() { return abs(draw()); } // in the interval [0,2**31]
};
class Urand : public Randint { // uniform distro in the interval [0:n[
public:
  Urand(long s = 0) : Randint(s) { }
  long next(long n) { long r = long(n*fdraw()); return (r==n) ? n-1: r; }
};

So you can see that we have a nice concise class here, then some needless inheritance. Yup, must be a Stroustrup design. (Well, in his defence, I think he was using random number generation to illustrate principles of inheritance.)

OK, let’s jump forward a couple of years to yesterday. I’ve read about 3/4 megabyte of other people’s C++ in the interim and discovered that my advisor was right all along: Just Pretend It’s C.

#define N 5
#define TRIALS 300000
void random_test() {
  Urand uni((unsigned)time(0));
  int is[N];
  for(int i = 0; i < N; i++) {
    is[i] = 0;
  }
  for(int _ = 0; _ < N * TRIALS; _++) {
    is[uni.next(N)]++;
  }
  for(int i = 0; i < N; i++) {
    cout << is[i] << ' ';
  }
  cout << endl;
}

Well, you can see right away a C-like use of the preprocessor to define constants. I remember Stroustrup says not to do that, but I can’t remember what the replacement is. So I Just Pretend It’s C.

Next, I declare a random number generator and an array to count how many times each number in the range N shows up.

Now there are three loops. Loop 1 initialises the array to 0. Loop 2 generates N*TRIALS numbers in the range N. Every time a number shows up, it increments the appropriate elements in the counting array. I used ++ because the languages I normally use don’t support it and it was exciting.

Finally, I left-shift cout by a bunch of numbers. Wait, I mean “print out the results". Because there are N elements in the array and N*TRIALS numbers generated, each element should be almost equal to TRIALS. In other words, the output should be something like this:

300902 299324 299833 300159 299782

In fact, because this is C, you should actually check that the sum of those numbers is N*TRIALS to make sure you weren’t incrementing like, is[N] or is[-1] by mistake.

OK, so that’s the simple case. In addition, I want to verify that the specific ways I use the random number generator are working. So I need to test two things: (1) does my sampling code (choice) work? (2) does my shuffling code (shuffle) work?

template <class T> T choice(const vector<T>& l) {
  static Urand uni((unsigned)time(0));
  return l[uni.next(l.size())];
}

template <class T> void shuffle(vector<T>& l) {
  static Urand uni((unsigned)time(0));

  for(int i = l.size(); i > 1; i--) {
    int j = uni.next(i);
    T tmp = l[j];
    l[j] = l[i-1];
    l[i-1] = tmp;
  }
}

Both functions cache their own static instance of the random number generator. I did it this way to avoid globals. I don’t know if this idiomatic C, because I don’t remember ever seeing static used in real code. Besides that, I don’t think there’s anything remarkable about this code. It’s the shuffle algorithm from Knuth, the efficient one that I can’t use in Haskell without dropping into the ST or IO monad. So that’s cool. I like it.

To test choice, we want to see whether each element in a list is chosen about the same number of times. Since Urand::next works, it’s pretty easy to suspect that choice works, but let’s write the test anyway.

void choice_test() {
  vector<int> l(N);
  int is[N];
  for(int i = 0; i < N; i++) {
    is[i] = 0;
    l[i] = i;
  }
  for(int _ = 0; _ < N * TRIALS; _++) {
    is[choice(l)]++;
  }
  for(int i = 0; i < N; i++) {
    cout << is[i] << ' ';
  }
  cout << endl;
}

Yeah, pretty much the same as random_test except there’s a vector l that holds the numbers 0-(N-1). Testing shuffle, on the other hand, is more interesting.

void shuffle_test() {
  vector<int> l(N);
  int rs[N][N];
  for(int i = 0; i < N; i++) {
    l[i] = i;
    for(int j = 0; j < N; j++) {
      rs[i][j] = 0;
    }
  }
  for(int _ = 0; _ < N * TRIALS; _++) {
    shuffle(l);
    for(int i = 0; i < N; i++) {
      rs[i][l[i]]++;
    }
  }
  for(int i = 0; i < N; i++) {
    for(int j = 0; j < N; j++) {
      cout << rs[i][j] << ' ';
    }
    cout << endl;
  }
}

Again there are three loops, but now they’re nested and the test array is 2D. That’s because the way to test shuffle is to make an vector l of length N, shuffle it N*TRIALS times, and make sure that each element of l appears in each position approximately TRIALS times. So if N=3 and TRIALS=1000, then we expect 0 to appear in position 0 1000 times, position 1 1000 times and position 2 1000 times. Likewise, we expect 1 to appear in position 0 1000 times, position 1 1000 times and position 2 1000 times. Finally, 2 should appear in position 0 1000, 1 1000 and 2 1000. So you can see that the test matrix has to be NxN.

012
0100010001000
1100010001000
2100010001000

Here is some example output for the actual definitions of N and TRIALS from above (5 and 30000). Again, summing each row to make sure that they’re all equal to N*TRIALS is a good idea.

300943 299623 299526 300206 299702
299383 299473 300661 299518 300965
300022 299529 300791 300685 298973
299796 300818 299084 299717 300585
299856 300557 299938 299874 299775

Postscript

I have to say that I didn’t think the generator I implemented really worked. I figured either a throwaway example in The C++ Programming Language would be too simple, or I would have made a mistake implementing it somewhere. Neither turned out to be true. Well, I did have a bug in shuffle at first. Turns that you can’t swap things using references:

    T& tmp = l[j];
    l[j] = l[i-1];
    l[i-1] = tmp;

You have to make a real temporary copy of the thing.

    T tmp = l[j];
    l[j] = l[i-1];
    l[i-1] = tmp;

At least if your input list is of type vector<T>; I suppose you could do it with a vector<T*> in which case you have an additional indirection all the time. So never mind.

07/03/10

Permalink 01:40:33 pm, 1452 words
Categories: Code, Linguistics

Bing Maps API

Recently my advisor suggested that I compare my results to travel distance between towns instead of straight-line geographical distance. That seems like a good idea—if there’s a huge lake and a mountain in between two towns, the languages are likely to be very different. The problem is taking the time to gather all the distances; it’s not a big deal to identify 30 or so sites on Google Maps, but when you need to find the 435 distances between all 30 sites, you probably want a script to do it.

The obvious choice for this is an API to the mapping service you were going to use in the first place. For me, that led to a couple of choices:

Google Maps vs Bing Maps

Google Maps requires me to set up a web page (publicly accessible) and make a visible embedded map, then manipulate it with Javascript. This is really intended for mashups I think. I was disappointed because I prefer to let my knowledge of HTML and Javascript rot and moulder. (I found some discussion that indicates there is no SOAP API, although it’s still possible that I just missed it.)

In contrast, the 3rd Google result for “google maps soap api” is for Bing Maps, admittedly under the previous branding “Virtual Earth". The SOAP API that allows you to pretend (given a Proper Library) that you are calling a function that magically gives distances between two points. The Proper Library manages all the XML conversion/parsing and sending over the Internet. So Bing Maps won this one, because I just want to import a library and call a function from a script instead of embedding the whole thing in a web page. This led to a second choice:

Programming Language

I momentarily considered Haskell, because it does have a lot of libraries listed in Hackage. But it’s pretty weak on the XML front and its lone SOAP library is sort of just a text wrapper around XML that you have to build yourself. No XML conversion or parsing, just the sending-over-Internet part.

From there I went to Python. I quickly found that Dive Into Python has a chapter on SOAP, but it’s 6 or 7 years out of date, and SOAPpy, the library it uses, to has been replaced by ZSI, “Zolera Soap Infrastructure". I distrust these companies that start with Z (*cof* Zope *cof* Zend) and the source was package in a .egg (or maybe a .muffin? no it was definitely .egg) which I don’t even know how to install and it was all just Fear-Uncertainty-Doubt.

The real reason that I wasn’t enthusiastic about installing a Python library is that (1) I probably will never use SOAP again and (2) in the Bing Maps documentation, it mentions that Visual Studio has SOAP libraries installed with it by default. And all the examples are in C#, so it becomes a matter of cut-and-paste. And Visual Studio has a wizard for generating the proxy objects you need in a static language. (Imagine that, Visual Studio having a wizard!) Well, the wizard was the last straw—I mean who can resist all that code the wizards generate for you? Plus it was getting late and I just wanted to get something working. I even resisted the temptation to write the code in F#, because F# is extremely short on wizards right now.

(The code would have been a lot better in F#, though. It has tuples and pattern matching and it’s hard to write a program these days that doesn’t use one or both of those features. You’ll see when I show you the code.)

So let’s see. Here’s the gist of my workflow. I went to Bing Maps and identified all 30 of the interview sites. Then I dumped to some format called KML. Now, Bing Maps doesn’t make it obvious how to give somebody a link to these sites, but the option to dump to KML is front and centre. I suspect that it started life as a desktop product. Anyway, it’s annoying for directions but it’s perfect for my purposes. (Hint: click the icon of a letter in the bottom-left, even if you just want a link to the map.)

I did grep coordinates swedia.kml >pbcopy and pasted that in to an emacs buffer. From there I recorded some macros to produce a Python literal of a list of pairs. With the list loaded in the REPL, I added on the location names to each location, something like this:

[(site,lat,long) for (site,(lat,long)) in zip(consts.swediaSites, coords)]

At this point I had a list in Python whose elements looked like this:

("Ankarsrum", 16.33, 57.70)

Then I pasted the list back into the emacs buffer and added some noise around each line to make it valid C#:

sites.Add(new Loc("Ankarsrum", 16.33, 57.70));

I should have just written a string format comprehension in Python but I was really tired by this point.

This, of course assumes a struct named Loc and a List<Loc> called sites. But that’s easy enough to do in C#. Pardon my outmoded accent—the machines I coded for at Bing had not yet updated to .NET 3.5 so I didn’t really learn a lot of 3.5 conveniences fluently.

    struct Loc
    {
        private string _site;
        private double _lat;
        private double _long_;
        public String Site { get { return _site; } }
        public double Lat { get { return _lat; } }
        public double Long { get { return _long_; } }
        public Loc(String site, double lat, double long_)
        {
            _site = site;
            _lat = lat;
            _long_ = long_;
        }
    }
    class Program
    {
        private static List<Loc> sites = new List<Loc>(30);
        private static void init()
        {
            sites.Add(new Loc("Ankarsrum", 16.331280825605035, 57.700431990646344));
            // ... 29 more of these ...
        }

OK, so that’s the boring setup. Then I needed a pairwise iteration over a list. This is a lot easier with linked lists and type inference—managing indices is fiddly and the type annotations get unwieldy so I said Forget It and wrote some C-like code instead. I didn’t want to fiddle with it to get it right. (But it is pretty ugly).

        private static IEnumerable<int[]> pairwise(int n)
        {
            for (int i = 0; i < n; i++)
            {
                for (int j = i + 1; j < n; j++)
                {
                    yield return new int[] { i, j };
                }
            }
        }

Finally, the code in Main is not all that long as badly written C# goes:

        static void Main(string[] args)
        {
            init();
            var req = new BingMaps.RouteRequest();
            req.Credentials = new BingMaps.Credentials();
            req.Credentials.ApplicationId = 
              "Very-Long-String-You-Get-By-Registering-With-Bing";
            BingMaps.Waypoint[] route = new BingMaps.Waypoint[2];
            req.Waypoints = route;
            var client = new BingMaps.RouteServiceClient(
              "BasicHttpBinding_IRouteService");

            foreach (int[] indices in pairwise(30))
            {
                var from = sites[indices[0]];
                var to = sites[indices[1]];
                route[0] = new BingMaps.Waypoint()
                { 
                    Description = from.Site,
                    Location = new BingMaps.Location() { 
                      Latitude = from.Long, Longitude = from.Lat 
                    }
                };
                route[1] = new BingMaps.Waypoint()
                {
                    Description = to.Site,
                    Location = new BingMaps.Location() { 
                      Latitude = to.Long, Longitude = to.Lat 
                    }
                };
                var d = client.CalculateRoute(req).Result.Summary.Distance;
                Console.WriteLine(string.Format("(\"{0}\", \"{1}\"): {2},", 
                  from.Site, to.Site, d));
            }
        }

I added all the type inference it could handle and shortened some of the ridiculous example names. The API is still pretty verbose but I think that’s because it’s SOAP. Let’s see, you have to

  1. Create a request.
  2. Create new credentials and give it your personal key.
  3. Create a new route and add the route to the request.
  4. Create a new Route Service client. (There are several other services, geocoding for example.)
  5. Inside the loop, set each point in the route.
  6. Also inside the loop, call client.CalculateRoute, passing in the request. This is the actual SOAP call that sends XML across the wire and back.
  7. Print out the distance of the route summary.

If you specify two points that do not have a connecting route, then you get a hilarious null pointer exception somewhere along the chain CalculateRoute().Result.Summary.Distance. It seems to me that this should be an exception, but no, just null.

Really, though, the only reason that I got null pointer exceptions is that KML stores its points in Longitude, Latitude order, while the rest of the world uses Latitude, Longitude order. So instead of two Swedish villages, I requested the distance between two points in the Arab Sea. The alert reader will notice the switcheroo I have to pull to make this work. I should have changed the Loc constructor but I didn’t think of that until just now.

The code takes about 100 seconds to get all 435 distances. It’s faster than I expected, so I guess most of the lag in the web interface is the browser drawing the route.

In conclusion, this probably won’t be useful to anybody else, but at least maybe you’ll think it’s cool to see how you can easily get driving distances between all unique pairs of sites.

03/03/10

Permalink 02:43:15 pm, 1007 words
Categories: Code, Linguistics

Reading and writing consensus trees in Haskell

I use the consensus tree code I wrote about the last two times in my dissertation to find agreement between clusters of Swedish dialect interviews. The input is cluster trees from R (the open source S) and the output is formatted for use by the qtree Latex package. I also need to filter out cluster trees that contain non-significant distances—all the input distances should be statistically significant.

main = do
  sigs <- withFileLines findSigs "sig-10-1000-interview.csv"
  interactFiles (withFileLines (makeConsensusTree sigs & list)) qtree

makeConsensusTree uses the list of significant files and qtree converts a Tree to a String, suitable for output to stdout. But first, findSigs:

findSigs =
  tail
  & map (replace ',' ' ' & words & tail
         & zip ["path", "trigram", "dep", "psg", "grand", "unigram", "all"]
         & filter (snd & (=="0"))
         & map fst)
  & zip ["r", "r_sq", "kl", "js", "cos"]
  & concatMap (uncurry (zip . repeat))
  & Set.fromList

(Reminder: (&) == flip (.)) findSigs is a hoky CSV parser. It drops the column heads (tail), then chops each line into fields (map (replace ',' ' ') & words & tail), assuming that they don't contain spaces so that it can use words to do the chopping.

Then it zips the hard-coded column heads back on, and throws away fields that aren't (=="0")—that is, feature sets that have more than 0 non-significant distances.

For each line, the significant feature sets are zipped with their with distance measure, and a hilarious and incomprehensible concatMap call turns them into a list of pairs like [("path","r"), ("trigram", "r"), ("path", "js") ...].

I know this is statically typed, but I don't think anybody would hesitate to call this ad-hoc mess a script. It was written in about 5 minutes for a specific file and will break as soon as that file changes.

Moving on.

makeConsensusTree sigs =
  filter (isPrefixOf "Cluster: ")
  & map rReader
  & filter (fst & (`Set.member` sigs))
  & map (snd & buildRTree)
  & consensus

makeConsensusTree is the top-level function for reading in the different dendrograms and writing out a consensus tree. Its first task is to read Yet Another Tree Format, this time R's binary dendrogram format. Each tree is on a line that starts with "Cluster: " (hence, filter (isPrefixOf "Cluster: "). Then the measure and feature set follows, and finally a list of numbers. If you divide those numbers in half (which rReader does), you get two columns.

rReader line = 
  ((measure,features),
   splitAt (floor (fromIntegral (length ns) / 2)) ns |> uncurry zip)
  where (_:measure:features:ss) = words line
        ns = map read ss

For example, if you have the line

Cluster: r all -2 1 -4 2 -4 -9 -8 3

The numbers are interpreted as the following two columns:

1.-2-3
2.1-9
3.-4-8
4.23

buildRTree further interprets these numbers as a binary tree: negative numbers are leaf nodes, (negative) indices into the interview sites or whatever the things are that you're clustering. Positive numbers are internal nodes, indices that refer to a previously created node. The index it refers to is the step at which the node was created. So, in the above example, internal node 1 has two leaf children: values 2 and 3. Internal node 2 has two children. One is node 1 and the other is a leaf (with value 9). The last node, 4, is the root. For the tiny example above, the Lisp representation of the tree is:

'(((2 3)
   9)
  (4 8))
label (Leaf a) = a
label (Node a _) = a
buildRTree tuples = last nodes
  where nodes = map buildNode tuples
        buildNode (i,j) = Node ('$':label inode++label jnode) [inode,jnode]
          where (inode,jnode) = (buildChild i, buildChild j)
        buildChild i | i < 0 = Leaf $ swediaSites !! (-i-1)
                     | otherwise = nodes !! (i-1)

The way that buildRTree implements this is kind of clever; it has to refer to previous parts of the node list as the list is being built. In Haskell, laziness is the most natural way to do this. So when buildChild implements the negative/positive → leaf/non-terminal decision, it refers to nodes for the non-terminal case. But nodes is just map buildNode and buildNode first calls buildChild and oh-no-circular-definition!

But it turns out all right, because, assuming that R has properly dumped the node indices, buildChild's i will always be less than the length of nodes at the time that it's required.

(Of course, you should always avoid (!!) and this particular use is egregiously O(n2) blah blah, but remember once again that this is a script and that I happen to know that I'll never run this on trees with more than 100 terminals. And that I'm willing to wait for up to 5 minutes.)

qtree (Leaf a) = Set.findMax a
qtree (Node a []) = Set.findMax a
qtree (Node a kids) = "[. {" ++ left ++ "} " ++ right ++ " ]"
  where (leaves,nodes) = partition leaf kids
        left = intercalate "\\\\" (map qtree leaves)
        right = unwords (map qtree nodes)

Finally, qtree converts a tree into Yet Another Tree Format, this time the qtree Latex package. I can then paste the string into my dissertation. It's pretty straightforward except for the abuse of Set.findMax—it's not finding anything since there's only one item left in the set anyway. I just don't have a way to destructure sets.

Well, now you've seen how messy script code can be in Haskell. I think it's a pretty good argument that static typing doesn't buy you much up front—I can be as messy and ugly as I want in either Python or Haskell. All you can hope for is that the type system gets out of your way while you're doing it. The win comes in six months when I decide I need to make the code more robust or (more likely) add a tiny feature. Haskell is more discoverable out of the box—no type documentation needed: the compiler figures it out for you—and more reassuring to make changes—less need for tests: many breaking changes will cause the compiler to complain.

In case you notice a tacit unwillingness to write documentation and tests for my script code, you got me. Yup. I wrote tests for some of it, but after the code was done. And documentation? I'm writing a dissertation! It's pages and pages of high-level documentation! (Just don't expect any commentary on actual code.)

1 2 3 4 5 6 7 8 9 10 11 ... 20 >>

blog soft