| « typed module for Python | Wii Classic Controller on Mac » |
I’ve thought math notation was a bad way to express a lot ideas for a while, but this is the first time that I’ve had a good example and comparison to code.
Sussman has talked about this for some time, and I’m sure others have as well. (Skip to 5:00 in the video, although the whole thing is worth watching)
Here is an example from Manning and Schutze’s standard textbook, Foundations of Statistical Natural Language Processing. The book (locally, at least) has the reputation for being a dense reference rather than a good teaching book. If you’re looking for a first introduction to NLP/computational linguistics, check out Jurafsky and Martin’s Speech and Language Processing, although if you’re going into the field, you really need both; Manning and Schutze cover more topics more completely, but at the expense of comprehensibility.
The chapter on Hidden Markov Models is a good example. The equations below are the formal part of the explanation of efficient prediction of a state sequence, given some output. For example, if you’re given “The dog barks” and an HMM whose states are parts of speech, you can use this algorithm to efficiently calculate that DET-N-V is the most likely state sequence.

Here we see the list of definitions. This is not so bad, except that Pi, A and B are not really mnemonic, at least not for me. Guess S was already used, though. K is sort of mnemonic, but I can’t remember why. Somehow I remember the standard letter for alphabet as A (oops, already used). The worst part for me is that I don’t naturally pay attention to the i,j subscripts, even though they are critical in telling you how the various parts of the data structure are connected.

This is a standard recursive definition, with some minor annoyances like the fact that the second clause is alpha_j, not alpha_i. And its parameter is t+1, not t, so you have to remember that t in the equation means ‘previous row’. The real problem is the second half of the second clause, which is filled with tiny little indices and followed by a mysterious 1 ≤ t ≤ T, 1 ≤ j ≤ N that seems to indicate that the thing is iterated multiple times…somehow. It’s probably T*N times anyway. The total is fairly simple, so much so that it’s a better starting point for understanding how the whole thing works.
(These diagrams were regenerated manually, so any typos are my own. The Latex code behind them is probably quite close to the originals, though.)
Here’s a fairly straightforward translation to code. It depends on the data structure Model, shown above. No surprises there, except that most of the documentation has disappeared into the name and type of the fields.
state = str
edge = (state,state)
Model = struct('Model',
alphabet=[str],
states=[state],
initials=[float],
transitions={edge:float},
outputs={edge:{str:float}})
The mathematical definition also relies on memoisation to make the problem look recursive and yet remain efficient. To mimic this, I manually record each call to the function in a table, although a language like Haskell would do this automatically, or I could have written a memoisation decorator.
def prob(outputs, mu):
f = {}
def forward(time, dst):
if time==0:
f[time,dst] = mu.initials[mu.states.index(dst)]
else:
f[time,dst] = sum(f[time-1,src]
* mu.transitions[src,dst]
* mu.outputs[src,dst][outputs[time-1]]
for src in mu.states)
return f[time,dst]
return sum([[forward(time, state) for state in mu.states]
for time in range(len(outputs)+1)][-1])
Tellingly, the only bug in the first version of the code was in the indexing. I forgot that both instances of time in the sum have one subtracted from them. When I got an oobounds error from inside forward, I just assumed it arose from the range(len(outputs)+1) expression, because I never know whether to add 1 when translating from 1-based examples to 0-based. So I took out the +1 and wondered why my answer was still buggy. Finally I noticed that I had f[time-1,src] but mu.outputs[src][dst][outputs[time]].
Moral of the story: indirection is hard to grok and the language can’t help you because you’re just throwing many small ints, all named i.
The straightforward translation is a good compromise between simple code and similarity to the formula. There’s nothing too surprising here for imperative programmers. More importantly, the algorithm is now completely formal–a machine following only the ’simple’ rules of a Python interpreter can produce the answer just as well as a Real Mathematician. In addition, a lot of connections implicit either in the semantics or syntax of mathematical notation show up. Properties of the model are textually associated with it, and memoisation is visible as well (though it need not be). The 1 ≤ t ≤ T, 1 ≤ j ≤ N is better connected to the rest of the process instead of floating to the side, waiting to be figured out. And the syntax doesn’t allow you to pass parameters as subscripts to the function name! (That drives me crazy)
But it’s possible to do a lot better. There is still a lot of explicit indexing, which means that probably we aren’t fully taking advantage of the shape of the problem.
If you read functional code and don’t care about similarity to the formula, which was hard to understand in the first place, the following code is more efficient, shorter and easier to read. Memoisation is gone: only two rows of the table are kept–the current one and the previous one. This is fine because the previous version never looked further back than time-1. In fact, it makes the correspondence explicit between each state and its entry in the previous row because they are explicitly zipped together. The only indices left are edges–transition and emission probabilities.
def prob(outputs, mu):
def forward(prev, output):
return [sum(prob*mu.transitions[src,dst]*mu.outputs[src,dst][output]
for prob, src in zip(prev,mu.states))
for dst in mu.states]
return sum(reduce(forward, outputs, mu.initials))
This version is idiomatic functional code that adequately and formally expresses the algorithm even in an explicitly optimised form. Tellingly, this version had no bugs. Well, almost. I forgot that foldl is called reduce in Python and has its arguments oorder.
Here’s the example from chapter 9. The answer given in the book is 0.315, not 0.361. As far as I can tell by inspecting the memoisation table, it’s a rounding problem; the early numbers match up at first and fall out of sync. Python isn’t very good at math, it’s true.
>>> mu = Model(states=['CP', "IP"],
initials=[1.0, 0.0],
alphabet=['ice-t', 'lemon', 'coke'],
transitions={("CP","CP"):0.7, ("CP","IP"): 0.3,
("IP","IP"):0.5, ("IP","CP"): 0.5},
outputs={("CP","CP"):{'ice-t':0.1, 'lemon':0.3, 'coke':0.7},
("CP","IP"):{'ice-t':0.1, 'lemon':0.3, 'coke':0.7},
("IP","IP"):{'ice-t':0.7, 'lemon':0.2, 'coke':0.1},
("IP","CP"):{'ice-t':0.7, 'lemon':0.2, 'coke':0.1}})
>>> forward(['lemon', 'ice-t', 'coke'], mu)
0.036119999999992
I just noticed that transitions and outputs could be modified to be a matrix rather than a lookup table, allowing me to loop over them directly. I’ll let you know how it turns out.
Here is struct in case you wondered:
def struct(name='struct', super=object, **typeargs):
def leftovers(passed, defaults):
return dct.extract(defaults, *set(defaults) - set(passed)).items()
class K(super): pass
@check(K, None, **typeargs)
def __init__(self, **kwargs):
super.__init__(self)
for attr, v in kwargs.items():
setattr(self, attr, v)
for attr, v in leftovers(kwargs, typeargs):
setattr(self, attr, default(v))
K.__init__ = __init__
K.__name__ = name
return K
Using it should be very familiar if you ever used Common Lisp, except that you can’t specify the default value when a parameter is missing. Also the constructor is type-checked. I may make an non-checked version too. (This function is in the typed module along with typed.check.)
I am totally getting into this typed programming in Python, guys. It’s almost as good as I had imagined untyped programming in Haskell would be.
# Massive use of indices. This creates a typeless environment, in which it’s hard to connect one equation to the next, because everything is connected by indices…And these indices are invariably named i or j.