« Consensus trees implemented in HaskellClassifying Rock Band, Metroid Prime 3 and Portal in 2D space »

Syntactic distance in C++

24/02/10

Permalink 11:34:43 am, 3484 words
Categories: Code, Linguistics

Syntactic distance in C++

I believe in the Golden Rule, which is (for those not current on the first gospel) "do for others what you'd like them to do for you". And I like to read code. But it's not really fair to ask to see other people's code without showing mine too. So here's my dissertation code. Well, the part that's unique to my dissertation at least. And minus all the Haskell code that does all the pre-, post- and interstitial processing, and the Python glue that runs it all.

Some caveats: this is fairly simple code, but it's surrounded by a lot of infrastructure. It has to work fast and elegance is not the first priority. Oh, did I mention that it's in C++? So elegance CAN'T BE the first priority. I'm not a native C++er so the code reads a lot like the Caml original. I kind of wish I'd stuck with Caml, but at the time I was trying to reduce memory use at all costs.

Note: The source files don't mention it, but the code is available under the BSD licence.

The last caveat is that if you want to know the purpose of this code, you really need to read one of my papers or John Nerbonne and Wybo Wiersma's papers. I will try to explain along the way, but no guarantees about the explanation's quality. I apologise ahead of time.


OK, so the core code is in "icecore.h". For each of the three programs, I have three small cpp files that #include "icecore.h" and each do something slightly different. It's not the greatest setup—I suppose I should have the implementations separate from the .h in icecore.cpp—but it's simple and works well with my Python wrapper.

#include <string>
#include <list>
#include <ext/hash_map>
#include <iostream>
#include <fstream>
#include <cmath>
#include "params.h"
#define SIGNIFICANCE ITERATIONS / 20

The only interesting part of this code is "params.h". My Python script writes a new version of params.h for each variant of measure/features/iterations/etc. ITERATIONS is one of those parameters, so then SIGNIFICANCE is ITERATIONS / 20. It bothers me for constants to randomly pop up in my code, un-namespaced or listed in an import statement, but this seems to be a natural part of life in C/C++. Experts can correct me if I'm wrong—maybe C++'s practise is to stick constants in their own namespace.

namespace __gnu_cxx { // culled from the internet.
  template<> // just wraps hash<char *>, which is entirely logical
  struct hash<std::string> { // and should have already been done
    hash<char *> h;
    size_t operator()(const std::string& s) const {
      return h(s.c_str());
    }
  };
}

Actually, one of those includes hides another tricky thing: including a hash map instead of a tree map. Yeah, I know you were thinking "C++, that's an imperative language. The STL will provide a hash map implementation!". To you I say HA. HA HA. Of course not! The STL provides a TREE map, which has log(n) insertion and lookup and is appropriate for use in functional code. That makes it pretty out of place in C++.

So now, yeah, I can tell you are thinking "GCC, it provides a hash map implementation. I bet it conveniently interoperates with all the STL types." To you I now shake my head with tears. Tears of failed compilations. No, it turns out that you have to manually specialise the hash to work with strings. (you know, those things that you ALWAYS use as hash keys) Strings don't work out of the box. Fortunately there are a lot of people using C++, so it's fairly simple to copy/paste from other blogs.

using namespace std;
using namespace __gnu_cxx;
/* Ara */
double sum(const vector<double>& a) {
  double total = 0.0;
  for(int i=0; i < a.size(); i++)
    total += a[i];
  return total;
}
/* Hash */
hash_map<string,double, hash<string> > count (const vector<string>& a) {
  hash_map<string,double,hash<string> > h;
  for(int i=0; i < a.size(); i++) {
    h[a[i]]++;
  }
  return h;
}

Now for a couple of utilities. sum is straightforward. count is what I now call histogram: count how many times each string happens in a list. It's the core of the program really: at its heart, the statistical classifier basically counts how many times the string X happens in dialect 1 and compares how many times X happens in dialect 2. Everything else is just extra to make sure the counts are actually meaningful.

Those "Ara" and "Hash" comments are remnants from the much larger utility libraries I wrote in Caml. All that stuff went away as I simplified for my limited knowledge of C++ (also known as: "I hate #include").

/* random classes from Stroustrup's 3rd edition */
class Randint {
  unsigned long randx;
public:
  Randint(long s = 0) { randx=s; }
  void seed(long s) { randx=s; }
  //magic numbers chosen to use 31 bits of a 32-bit long
  static long abs(long x) {return x & 0x7fffffff; }
  static double max() { return 2147483648.0; } // note: a double
  long draw() { return randx = randx * 1103515245 + 12345; }
  double fdraw() { return abs(draw()) / max(); } // in the interval [0,1]
  long operator()() { return abs(draw()); } // in the interval [0,2**31]
};
class Urand : public Randint { // uniform distro in the interval [0:n[
public:
  Urand(long s = 0) : Randint(s) { }
  long next(long n) { long r = long(n*fdraw()); return (r==n) ? n-1: r; }
};
template <class a> a choice(const vector<a>& l) {
  static Urand uni((unsigned)time(0));
  return l[uni.next(l.size())];
}

So yeah, it turns out that C++ also doesn't have a reliable way to generate pseudo-random integers in a certain range. Fortunately, Bjarne Stroustrup provides a couple of classes to do exactly that in The C++ Programming Language. The code looks a little too simple to be real, but, from what I've read, this does appear to implement one of the basic pseudo-random number generators.

typedef vector<string> strings; // I can't get these two to use const string
typedef vector<vector<string> > dialect;
typedef hash_map<string, double, hash<string> > entry;
typedef hash_map<const string, pair<double,double>, hash<string> > sample; // but this is fine. ??
typedef hash_map<const string, vector<pair<double,double> >, hash<string> >
  samples;
typedef vector<pair<double,double> > totals;

template <class T, class U> T fst(pair<T,U> p) { return p.first; }
template <class T, class U> U snd(pair<T,U> p) { return p.second; }
template <class T> vector<T> concat(const vector<vector<T> >& as) {
  vector<T> acc;
  for(int i=0; i < as.size(); i++) {
    acc.insert(acc.end(), as[i].begin(), as[i].end());
  }
  return acc;
}

Also I wrote fst, snd and concat. Told you this was a Caml port.

// = concat (maptimes (fun () -> choice dialect) 1000)
template <class T> vector<T> permutation(const vector<vector<T> >& dialect) {
  vector<T> acc;
  for(int i = 0; i < SAMPLES; i++) {
    vector<T> tmp = choice(dialect);
    acc.insert(acc.end(), tmp.begin(), tmp.end());
  }
  return acc;
}

Speaking of Caml, permutation's leading comment was the Caml code. Ahh, those were the days. Anyway, permutation is a bad name: it's actually a 1000-sentence sample from a corpus. A corpus is a vector of sentences, which are themselves vectors of strings, hence the type: vector<vector<T> >. I don't know why I didn't just use the typedef names. Wait, that's because I wanted it to be generic, which the typedefs are not. I don't know why I wanted that. No sense in trying to reuse code around here—it's a lost cause.

Anyway, choice(dialect) chooses a single sentence randomly. The words of the sentence are then inserted into the accumulator. (You can see why I didn't call it sample: I had already used it as a typedef: zip_ref below returns a sample.)

sample zip_ref (entry& h, entry& h2) {
  // I can't get these to be const entry&; apparently there is some
  // trouble with mixing iterator types in the h2[i->first] and h[i->first]
   sample h3;
   for(entry::const_iterator i = h.begin(); i!=h.end(); i++) {
     h3[i->first] = make_pair(i->second, h2[i->first]);
   }
   for(entry::const_iterator i = h2.begin(); i!=h2.end(); i++) {
     entry::const_iterator loc = h.find(i->first);
     if(loc==h.end()) {
       h3[i->first] = make_pair(h[i->first], i->second);
     }
   }
   return h3;
}

I don't understand C++'s type system. I can't pass const entry& h, but I can get a const_iterator from it.

Basically zip_ref combines two hash maps created by count (remember, that's a map from string to int), returning a map from string to pair of int,int (which is a sample). The first loop fills in all the values that are present in at least h. The second loop fills in all the values that are present only in h2. It checks to make sure that it only fills in values that don't exist in h. That's what comparing h.find(i->first)==h.end() does. But this is not necessary because C++'s hash_map::operator[] uses the int default constructor if the value is not in h. In fact, I rely on that inside the if, so all the if does is save some redundant overwrites.

Hmf. This probably isn't worth it, because I have to copy the whole hash anyway when I return from zip_ref. I should really get rid of the whole check inside the second loop. As I try to remember why I did it this way, I'm pretty sure I can blame this on the Caml original too. C++ has default constructors, which Caml doesn't.

sample normalise(const strings& a, const strings& b,
                 int iterations=5, size_t total_types=0) {
  entry tmp1 = count(a); // Because zip_ref takes entry&, not const entry&,
  entry tmp2 = count(b); // I have to provide real l-values to zip_ref
  sample ab = zip_ref(tmp1, tmp2); // that's what Bjarne Stroustrup said anyway
  double tokens_a = a.size();
  double tokens_b = b.size();
  double tokens = tokens_a + tokens_b;
  double types = tmp1.size() + tmp2.size();
  double ci, fa, fb, f;
  for(int i = 0; i < iterations; i++) {
    for(sample::iterator i = ab.begin(); i!=ab.end(); i++) {
      ci = i->second.first + i->second.second;
      fa = i->second.first / len_a;
      fb = i->second.second / len_b;
      f = fa + fb;
      i->second.first = (ci * fa / f) * 2 * types / tokens;
      i->second.second = (ci * fb / f) * 2 * types / tokens;
    }
  }
  return ab;
}

Normalise tries to undo some common skewing of the data so that the really interesting features are not drowned out by a few uninteresting-but-common features. It normalises for two properties.

The first is for sentence length: when you sample 1000 sentences, some sentences are longer than others. The corpus with longer sentences will have higher features counts just because it has more words per sentence. To get around this, each feature is scaled by its total corpus size:

// 0. Find total count
ci = i->second.first + i->second.second;
// 1. Convert to frequency
fa = i->second.first / tokens_a;
fb = i->second.second / tokens_b;
// 2. Find total frequency
f = fa + fb
// 3. Scale by total count and total frequency
cia = fa * ci / f
cib = fb * ci / f

For example, say you have a feature f that happens 8 times in corpus A and 10 times in corpus B. "Oh ho!" you say (if you are pretending to be overbearingly English), "Oh ho! f is 0.8 times as important in A as it is in B." (Being careful to italicise the f, which is difficult in speech but barely possible if you try hard enough).

But wait! What if you knew that corpus A only had 40 words while corpus B had 100? Now that 8 looks a lot more important. So it's better to use frequencies: 8 / 40 = 0.2 and 10 / 100 = 0.1. It turns out that f is twice as important in A as it is in B. But frequencies will mess up later steps of the normalisation, so let's share the original ci = 10 + 8 occurences between them in proportion to their important. So cia = (0.2 / 0.3) * 18 = 12 while cib = (0.1 / 0.3) * 18 = 6.

The other normalisation is for corpus variety. If a corpus has a lot of variety, it will have a lot of features that only occur a few times. On the other hand, if a corpus is fairly boring, it will have a few features that occur a lot. This normalisation pushes up the rare-but-interesting features from the variety corpus so they don't get drowned out by the common features from the boring corpus.

To do this, it calculates three values:

// 1. The number of features
double tokens = tokens_a + tokens_b;
// 2. The number of types of features in both A and B
double types = tmp1.size() + tmp2.size();
// 3. The total scaled count
cia *= 2 * types / tokens;
cib *= 2 * types / tokens;

The best way to think about this is that we are finding the average count of tokens per type, then scaling each token to find out how many times average it occurs. This scaling should make distances between sites depend more on the interesting features: the ones that occur more than average, instead of just the ones that occur more.

Going back to the previous example, imagine that corpus A (from the previous example) has only 5 types of features for its 40 total features. A pretty boring corpus—probably somebody talking about the weather for minutes on end. In contrast, let corpus B have 30 types for its 100 total features. This is an articulate interviewee, capable of using all types of crazy grammatical constructions in a short time.

This means together, corpus A and B have a combined variety of 2 tokens per type. (2 = (100 + 40) / ((30+5) * 2) = 140 / 35 * 2 = 140 / 70). The extra division by 2 comes in because 100+40 and 30+5 are the averages for both corpora at once. To get the average for just one, divide by two.

So, since the combined average count is 2 tokens per type, a feature type that happens 12 times is 6 times more important than average. So, in corpus A, f will be 6 = 12 / 2 = 12 / ((100+40) / ((30+5) * 2)) = ... Likewise, f becomes 3 in corpus B.


Anyway, math explanation aside, the structure of normalise is pretty bad—it handles all the conversion of raw features into fully counted maps AND does all the normalisation. The problem is that the normalisation needs lots of information, like the number of features, the number of feature types, etc. So normalise has way to much responsibility because it needs to know too much. (And I don't want to make a separate struct to hold a bunch of info for passing around.)

Now for the specific distance measures. They're pretty simple. Like, so simple that I've copy-pasted new ones from the old ones and got them to work without a single compile error. I know. That never happens, especially in C++. The reason that they're so simple is that once the sample is created and normalised, all that's really left to do is sum in some fancy way. (The simplest, R is just sum . (map (abs . (-))).)

double r(const sample& c) {
  double total = 0.0;
  for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
    total += abs(i->second.first - i->second.second);
  }
  return total;
}
inline double sqr(double d) { // or std::pow in cmath
  return d*d;
}
double r_sq(const sample& c) {
  double total = 0.0;
  for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
    total += sqr(i->second.first - i->second.second);
  }
  return total;
}
// this is actually KL symmetric divergence because it measures
// KL(P||Q) + KL(Q||P)
double kl(const sample& c) {
  double total = 0.0;
  for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
    if(i->second.first != 0 && i->second.second != 0) {
      total += i->second.first * log(i->second.first / i->second.second);
      total += i->second.second * log(i->second.second / i->second.first);
    }
  }
  return total;
}

// this is Jensen-Shannon divergence, which is based on KL divergence
double js(const sample& c) {
  double total = 0.0;
  for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
    if(i->second.first != 0 && i->second.second != 0) {
      double middle = (i->second.first + i->second.second) / 2;
      total += i->second.first * log(i->second.first / middle) / 2;
      total += i->second.second * log(i->second.second / middle) / 2;
    }
  }
  return total;
}
double cos(const sample& c) {
  double a_sq = 0.0;
  double b_sq = 0.0;
  double total = 0.0;
  for(sample::const_iterator i=c.begin(); i!=c.end(); i++) {
    a_sq += i->second.first * i->second.first;
    b_sq += i->second.second * i->second.second;
    total += i->second.first * i->second.second;
  }
  return 1 - total / (sqrt(a_sq) * sqrt(b_sq));
}

At one point I had Caml-equivalance comments before all the functions, but they made me feel sad so I took them out. If you want to know about Kullbeck-Leibler divergence or Jensen-Shannon divergence, Wikipedia has you covered. Cosine (dis)similarity is also pretty well understood.

bool comparepermutation(const dialect& a, const dialect& b) {
  dialect both_ab(a);
  both_ab.insert(both_ab.end(), b.begin(), b.end());
  int gt = 0;
  for(int i=0; i < ITERATIONS; i++) {
    sample total = normalise(permutation(a), permutation(b), 5);
    double total_r = R_MEASURE(total);
    size_t total_types = total.size();
    double perm_r =
      R_MEASURE(normalise(permutation(both_ab), permutation(both_ab), 5));
    if(perm_r > total_r) {
      cout << '-' << flush;
      gt++;
      if(gt > SIGNIFICANCE) return false;
    } else
      cout << '.' << flush;
  }
  cout << endl;
  return true;
}

OK, one last processing function: a permutation test. A permutation test tests for statistical significance. Here, the null hypothesis is that there the distance produced by r could be produced by ANY two random sites. To test the null hypothesis, I combine corpora A and B to both_ab, then I sample twice from both_ab. If the null hypothesis is false, then this distance between these two will be less than the distance between A and B.

Of course to make sure this is significant for p < 0.05, I have to run it at least 20 times. That's what ITERATIONS is (and SIGNIFICANCE is ITERATIONS / 20). R_MEASURE actually varies over all the distance measures I showed above, so I guess it should be just called MEASURE. Oh well.

Anyway, the loop can break as soon as more than 5% of the random mix distances are larger the original distance—p is no longer less than 0.05.

Yeah, so the last few functions just read in files and kick off the process. Nothing too complicated here, just a nested loop to read in sentences, one word per line, separated by ***.

pair<string, vector<vector<string> > > readfile(const char* filename) {
  ifstream f(filename);
  if(!f) {
    cout << filename << " is not found." << endl;
    throw filename;
  }
  string lang;
  getline(f,lang);
  vector<vector<string> > sss;
  vector<string> ss;
  string s;
  while(f) {
    getline(f,s);
    if(s=="***") {
      sss.push_back(ss);
      ss = vector<string>();
    } else {
      ss.push_back(s);
    }
  }
  sss.push_back(ss);
  return make_pair(lang, sss);
}

Finally, in icesig.cpp:

#include "icecore.h"
int main(int argc, char** argv) {
  if(argc==2) {
    pair<string, vector<vector<string > > > l1 = readfile(argv[1]);
    cout << "Control: " << l1.first << endl;
    cout << comparepermutation(l1.second, l1.second) << endl;
  } else if (argc==3) {
    pair<string, vector<vector<string > > > l1 = readfile(argv[1]);
    pair<string, vector<vector<string > > > l2 = readfile(argv[2]);
    cout << l1.first << endl;
    cout << l2.first << endl;
    cout << comparepermutation(l1.second, l2.second) << endl;
 } else {
    cout << "Not enough arguments: " << argc << endl;
    }
  return 0;
}

This program calls readfile and comparepermutation to find out whether the distance between the two corpora is significant. Other versions return the distances or the top features. They're very similar. This is the simplest so that's what I went with.


Anyway, if you have comments or questions, I will try to answer them. I am always happy to find bugs in this code.

1 comment

Comment from: Jack [Visitor]
This information is probably too late to be helpful, but FYI: C++0x has std::unordered_map (including a preset specialization of the hash function for std::string) and a full set of random number generators for various distributions (uniform, normal, lognormal, bionomal and several others).

Support for both of these is included in the latest versions of GCC (along with lambdas and some simple type inference support which makes life a little nicer if you're coming from an ML/Haskell/Scala sort of worldview).
24/02/10 @ 14:13

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by b2evolution free blog software