« typed module for PythonWii Classic Controller on Mac »

Math Notation Is Terrible

06/06/08

Permalink 11:51:03 am, 1532 words
Categories: Code, Linguistics

Math Notation Is Terrible

I’ve thought math notation was a bad way to express a lot ideas for a while, but this is the first time that I’ve had a good example and comparison to code.

Reasons

  1. Massive use of indices. This creates a typeless environment, in which it’s hard to connect one equation to the next, because everything is connected by indices…And these indices are invariably named i or j.
  2. That’s right, you’re only allowed to use single-letter variable names. To avoid running out, people use capitals, greek letters and fancy italics in various incompatible ways.
  3. Math notation often collapses multiple items into what would be a single object in a computer program, but only specifies how they relate in a separate figure. The rest of the time, the relation is implicit.
  4. Unlimited syntax!
  5. Yet, because some syntaxes are popular, they get re-used a lot to mean different things. Juxtaposition is the worst, because it’s implicit and thus doesn’t require writing anything.
  6. Most importantly, equations do not run on a computer. That means that something supposed to be fully formal, math, isn’t. It’s just like code examples in books–unless they are automatically tested every time the book is compiled, then you have to assume they’re wrong. And you can’t automatically test equations.

Sussman has talked about this for some time, and I’m sure others have as well. (Skip to 5:00 in the video, although the whole thing is worth watching)

Math notation

Here is an example from Manning and Schutze’s standard textbook, Foundations of Statistical Natural Language Processing. The book (locally, at least) has the reputation for being a dense reference rather than a good teaching book. If you’re looking for a first introduction to NLP/computational linguistics, check out Jurafsky and Martin’s Speech and Language Processing, although if you’re going into the field, you really need both; Manning and Schutze cover more topics more completely, but at the expense of comprehensibility.

The chapter on Hidden Markov Models is a good example. The equations below are the formal part of the explanation of efficient prediction of a state sequence, given some output. For example, if you’re given “The dog barks” and an HMM whose states are parts of speech, you can use this algorithm to efficiently calculate that DET-N-V is the most likely state sequence.

Here we see the list of definitions. This is not so bad, except that Pi, A and B are not really mnemonic, at least not for me. Guess S was already used, though. K is sort of mnemonic, but I can’t remember why. Somehow I remember the standard letter for alphabet as A (oops, already used). The worst part for me is that I don’t naturally pay attention to the i,j subscripts, even though they are critical in telling you how the various parts of the data structure are connected.

This is a standard recursive definition, with some minor annoyances like the fact that the second clause is alpha_j, not alpha_i. And its parameter is t+1, not t, so you have to remember that t in the equation means ‘previous row’. The real problem is the second half of the second clause, which is filled with tiny little indices and followed by a mysterious 1 ≤ t ≤ T, 1 ≤ j ≤ N that seems to indicate that the thing is iterated multiple times…somehow. It’s probably T*N times anyway. The total is fairly simple, so much so that it’s a better starting point for understanding how the whole thing works.

(These diagrams were regenerated manually, so any typos are my own. The Latex code behind them is probably quite close to the originals, though.)

Straightforward translation to code

Here’s a fairly straightforward translation to code. It depends on the data structure Model, shown above. No surprises there, except that most of the documentation has disappeared into the name and type of the fields.

state = str
edge = (state,state)
Model = struct('Model',
               alphabet=[str],
               states=[state],
               initials=[float],
               transitions={edge:float},
               outputs={edge:{str:float}})

The mathematical definition also relies on memoisation to make the problem look recursive and yet remain efficient. To mimic this, I manually record each call to the function in a table, although a language like Haskell would do this automatically, or I could have written a memoisation decorator.

def prob(outputs, mu):
    f = {}
    def forward(time, dst):
        if time==0:
            f[time,dst] = mu.initials[mu.states.index(dst)]
        else:
            f[time,dst] = sum(f[time-1,src]
                                * mu.transitions[src,dst]
                                * mu.outputs[src,dst][outputs[time-1]]
                                for src in mu.states)
        return f[time,dst]
    return sum([[forward(time, state) for state in mu.states]
                for time in range(len(outputs)+1)][-1])

Tellingly, the only bug in the first version of the code was in the indexing. I forgot that both instances of time in the sum have one subtracted from them. When I got an oobounds error from inside forward, I just assumed it arose from the range(len(outputs)+1) expression, because I never know whether to add 1 when translating from 1-based examples to 0-based. So I took out the +1 and wondered why my answer was still buggy. Finally I noticed that I had f[time-1,src] but mu.outputs[src][dst][outputs[time]].

Moral of the story: indirection is hard to grok and the language can’t help you because you’re just throwing many small ints, all named i.

The straightforward translation is a good compromise between simple code and similarity to the formula. There’s nothing too surprising here for imperative programmers. More importantly, the algorithm is now completely formal–a machine following only the ’simple’ rules of a Python interpreter can produce the answer just as well as a Real Mathematician. In addition, a lot of connections implicit either in the semantics or syntax of mathematical notation show up. Properties of the model are textually associated with it, and memoisation is visible as well (though it need not be). The 1 ≤ t ≤ T, 1 ≤ j ≤ N is better connected to the rest of the process instead of floating to the side, waiting to be figured out. And the syntax doesn’t allow you to pass parameters as subscripts to the function name! (That drives me crazy)

But it’s possible to do a lot better. There is still a lot of explicit indexing, which means that probably we aren’t fully taking advantage of the shape of the problem.

Optimised, functional version

If you read functional code and don’t care about similarity to the formula, which was hard to understand in the first place, the following code is more efficient, shorter and easier to read. Memoisation is gone: only two rows of the table are kept–the current one and the previous one. This is fine because the previous version never looked further back than time-1. In fact, it makes the correspondence explicit between each state and its entry in the previous row because they are explicitly zipped together. The only indices left are edges–transition and emission probabilities.

def prob(outputs, mu):
    def forward(prev, output):
        return [sum(prob*mu.transitions[src,dst]*mu.outputs[src,dst][output]
                    for prob, src in zip(prev,mu.states))
                for dst in mu.states]
    return sum(reduce(forward, outputs, mu.initials))

This version is idiomatic functional code that adequately and formally expresses the algorithm even in an explicitly optimised form. Tellingly, this version had no bugs. Well, almost. I forgot that foldl is called reduce in Python and has its arguments oorder.

Example run

Here’s the example from chapter 9. The answer given in the book is 0.315, not 0.361. As far as I can tell by inspecting the memoisation table, it’s a rounding problem; the early numbers match up at first and fall out of sync. Python isn’t very good at math, it’s true.

>>> mu = Model(states=['CP', "IP"],
           initials=[1.0, 0.0],
           alphabet=['ice-t', 'lemon', 'coke'],
           transitions={("CP","CP"):0.7, ("CP","IP"): 0.3,
                        ("IP","IP"):0.5, ("IP","CP"): 0.5},
           outputs={("CP","CP"):{'ice-t':0.1, 'lemon':0.3, 'coke':0.7},
                    ("CP","IP"):{'ice-t':0.1, 'lemon':0.3, 'coke':0.7},
                    ("IP","IP"):{'ice-t':0.7, 'lemon':0.2, 'coke':0.1},
                    ("IP","CP"):{'ice-t':0.7, 'lemon':0.2, 'coke':0.1}})
>>> forward(['lemon', 'ice-t', 'coke'], mu)
0.036119999999992

I just noticed that transitions and outputs could be modified to be a matrix rather than a lookup table, allowing me to loop over them directly. I’ll let you know how it turns out.

Struct

Here is struct in case you wondered:

def struct(name='struct', super=object, **typeargs):
    def leftovers(passed, defaults):
        return dct.extract(defaults, *set(defaults) - set(passed)).items()
    class K(super): pass
    @check(K, None, **typeargs)
    def __init__(self, **kwargs):
        super.__init__(self)
        for attr, v in kwargs.items():
            setattr(self, attr, v)
        for attr, v in leftovers(kwargs, typeargs):
            setattr(self, attr, default(v))
    K.__init__ = __init__
    K.__name__ = name
    return K

Using it should be very familiar if you ever used Common Lisp, except that you can’t specify the default value when a parameter is missing. Also the constructor is type-checked. I may make an non-checked version too. (This function is in the typed module along with typed.check.)

I am totally getting into this typed programming in Python, guys. It’s almost as good as I had imagined untyped programming in Haskell would be.

28 comments

Comment from: Jonathan Ville [Visitor] Email
Some of us are extremely good programmers but find mathematical notation extremely useful. You must just suck at maths.

Do you know what mathematical notation is for? I can think of countless algorithms where the things that happen in the algorithm are meaningless without an invariant. You can't write an invariant in the code; you need mathematical notation.

Mathematical notation can be used for specifying what an algorithm does; code or can be used to show how it does it. There is a difference between these things.

And there are thousands of concepts (some of which helped advance algorithm design) that can only be expressed using sets.

Your code is awful too.
06/06/08 @ 14:02
Comment from: Carl Lumma [Visitor] Email · http://lumma.org/microwave
Great post. To play devil's advocate, one thing about math notation is that it's extremely compact, which might have been good when pen and ink cost money (and writer's cramp was an issue).
06/06/08 @ 14:28
Comment from: jt [Visitor] Email
I'm sorry, I don't agree. I think code is way too verbose and particularly in imperative programming paradigms where you must "walk the code" in your head as if you were a CPU (as opposed to functional programming).

My suggestion is that you try harder at Math notation.
06/06/08 @ 14:45
Comment from: tom [Visitor] Email
ZOMG! Math notation is hard because you need to think when you read it!
06/06/08 @ 14:52
Comment from: Saurabh [Visitor] Email
Dude, anybody with a decent education will find that your code sucks balls at readability compared to your math. Do the community a favour - stay far away from math, and STFU about it.
06/06/08 @ 15:43
Comment from: pozorvlak [Visitor] Email · http://pozorvlak.livejournal.com
I agree, and FWIW, I'm a mathematics grad student. Mathematical notation is in general awful.
06/06/08 @ 15:55
Comment from: Matthew [Visitor] Email
Here's someone with a master's degree in maths who agrees (albeit reservedly).

Those taking the piss out of him (ZOMG you suck at maths) are missing what's quite a good point.

The point being: the fields of computer science and software engineering have spent a lot of time thinking hard about how to express precise formal systems in a way which is clear and quick to understand.

Why - when you're writing software, it really pays to design formal systems which result in expressions that are easy-to-read, clear and self-documenting - because if they're not, you'll find it prohibitively expensive to maintain the code.

In mathematics, on the other hand, the incentives to make things comprehensible are considerably less. Mathematics doesn't need to be 'maintained' in the same way as code - it's more just like marking a flag in the ground saying "I proved it; here's the bare minimum I need to convince my referees". Sure, one needs to teach people mathematics and real effort does go into this, but as an academic ones career motivations and incentives tend to come more from research than teaching.

Granted, in mathematics the underlying concepts are often deeper and require more intuition and hard work to grasp. But don't underestimate the role of syntax in making something understandable, and have a little humility in learning lessons from related fields which have worked hard at these problems from a slightly different perspective.

To the guy who thinks mathematics isn't algorithmic - read up about the Curry-Howard isomorphism - a constructive mathematical proof of a proposition IS an algorithm. And what's more this can be used to great effect to help those with a programming background understand mathematics. More work, IMO, should go into this approach to teaching.
06/06/08 @ 16:13
Comment from: Jeff [Visitor] · http://jeffbergman.com
I agree with your general point. I have found that mathematical notation is generally not very intuitive. Once you have invested time to learn it and understand the concepts it represents it provides a concise way to write certain things.

But, I often find it much easier to understand the meaning of some equation by rewriting it in some form of pseudo-code. I also think the notation can be a barrier of entry for people who don't already know it and that even different people use different mathematical notation for similar concepts.
06/06/08 @ 16:31
Comment from: Simon [Visitor] Email · http://www.simonbelak.com
@Jonathan: wouldn't functional programming yield invariants? Let's not forget compilers are quite apt at identifying invariants; even in procedural code.

Considering how programs are equivalent to mathematical proofs I find it rather hard to believe mathematical notation has an unsurpassable advantage in terms of expressiveness.

Your implying that sets cannot be modeled in a programming language makes me to wonder whether you were being self-ironic with "[s]ome of us are extremely good programmers".
06/06/08 @ 16:46
Comment from: sandersn [Member]
I conflated two problems here. I gave the impression that I think that math notation should be replaced by code. I don't. I posted this because I had to implement an HMM, and I used Manning and Schutze as the textbook source. So I had some code already, and liked it better than the formal explanation.

My point, though, is that my Manning and Schutze example exposes two problems. First is that math notation is awful. It needs to be replaced or updated or something, but I can't really say more than that--I'm going on authority from smarter people than me (see the Sussman video). I am not a mathematician and given the opportunity I will to continue to avoid math that I don't need for my linguistic work.

Which leads to problem two: the textbook is on *computational* linguistics. Mathematical linguistics is something else entirely. We're not proving anything--we're interested in doing something. More code (pseudo or otherwise) would have been appropriate since space is not the issue that it is with a journal article. Unfortunately, Manning and Schutze is harder to read than other textbooks in computational linguistics, I think mostly because they do squeeze in more topics and have to settle for less comprehensible writing.
06/06/08 @ 17:00
Comment from: Ugly American [Visitor] Email
Notation and varible names are routinely recycled in many different fields.

Anyone who claims math notation is always unambiguous hasn't studied very widely.
06/06/08 @ 17:53
Comment from: Deinst [Visitor] Email
I do not think that Sussman supports your argument though. See for example http://mitpress.mit.edu/SICM/book-Z-H-74.html#%_sec_6.3 (a random page from SICM). There is a whole lot of stuff that looks a whole lot like what you are complaining about.

What Sussman is saying is that physicists have been being sloppy with their math for generations. This isn't exactly news.

In programming terms math is designed to be very easily refactored. Usually one needs to use some formula in a slightly different situation than the one for which it was derived.
06/06/08 @ 18:34
Comment from: lux [Visitor] Email · http://luxonstuff.wordpress.com/
in general i also agree, so heres the beginnings of a soultion. basicaly, how about a symbol to prefix a symbol representing a constant not explained earlier in the text, to indicate it is the most well known use of that symbol. π works well for 3.14.., but e blends in so easily with the author’s own arbitrary letters that 2.718.. can take a few readings to get.
06/06/08 @ 18:36
Comment from: Goedel [Visitor] Email
Couldn't read through this... math wasn't found for lazy CS students, so don't bother trying, mkay?
06/06/08 @ 18:57
Comment from: Galo [Visitor] Email
So...this is where the "Why do I need a CS degree" arguments begin?
You need it so you don't look lame to the whole programming community. Math is beautiful. Programs are ugly.
06/06/08 @ 19:23
Comment from: ercd [Visitor] Email
Even though I agree with most of your point, the choice of the letters in the HMM algorithm are not so bad. Most of them come from what is usually used to describe Markov chains. I hope this will give a better insight of the algorithm:

  • pi are for the initial probabilities of the Markov chain.

  • A is very similar to the Adjacency matrix of the directed graph between the states (http://en.wikipedia.org/wiki/Adjacency_matrix): that's why it is very common to use it for state transition probabilities.

  • X_t is the state at the time t. It is very common to use markov chains to model a succession of states in a period of time: that explains the "t". The X is due to the fact that X_t is a random variable and random variable are usually denoted with a capital "X".

  • In a directed graph, the index i is usually used for the origin of an edge and j for the end of an edge.


I believe that mathematical notation is a very powerful and concise way to describe an algorithm. It can be obscure at first but when you learn what's behind (in that case: Markov chains, which are, imho, one of the coolest object in math) all this notation usually makes sense.

06/06/08 @ 19:28
Comment from: sucker!!! [Visitor]
your code sucks and your post sucks too.
06/06/08 @ 20:20
Comment from: macroencephalic eloi [Visitor] Email
Where is module 'typed'? Is it your own?
06/06/08 @ 21:13
Comment from: Timmy Jose [Visitor] Email · http://z0ltan.wordpress.com
I am just flabbergasted at your insinuations. I have studied math upto the undergraduate level but I am a programmer primarily, I find those mathematical notations so wonderfully lucid and clear that I begin to wonder if this blog was a deliberate ploy!

I have said this before and here I repeat it again - a great programmer is basically a good mathematician. Case rested.
06/06/08 @ 22:34
Comment from: Florian Jung [Visitor]
I think you missed the point what mathematics really is about. Of course as a programmer you may be most interested in the end results you can use for numerical calculations, but the main goal of mathematics is to derive such formulas, not use them!

This goal manifests in the use of notation that is optimized for easy manipulation of formulas, not for concrete calculations. For example single character variables are just so much faster to write by hand than whole words. You may argue that one can use computers to manipulate formulas, but until now I haven't found any way to do this efficiently since even the most advanced computer algebra systems (like Mathematica or Maple) quickly fail when you want to do abstract calculations for example in category theory or differential geometry.
07/06/08 @ 01:40
Comment from: Farley Knight [Visitor] Email · http://blog.farleyknight.com

# Massive use of indices. This creates a typeless environment, in which it’s hard to connect one equation to the next, because everything is connected by indices…And these indices are invariably named i or j.


I remember writing a proof for my discrete math class and using tons of indices.. It felt like such a waste of time because I would never write so many indices if it were Python/Ruby..
07/06/08 @ 07:25
Comment from: Daerin the Cluer [Visitor] Email
Jung: God forbid that anyone might actually use mathematics, instead of just playing around with it.

Ercd: Another deplorable thing in maths is the use of variables that have implicit meanings. Programmers have learned not only to not do this (a random variable should be called something like random_number, not captial X), but to document everything that isn't spelled out in the names of the functions and variables and their relationships. If we didn't, you would find this in the Linux Programmer's Manual:


NAME

g

SYNOPSIS

g(p, f, E, 学);

DESCRIPTION

This function does what functions named "g" typically do. If you know anything about programming, you should already know which parameters have what meaning, and you should also know their types.

RETURN VALUE

The type and meaning of the return value should be obvious to anyone who has a decent education.

SEE ALSO

Graduate-Level Pattern-Matching, 6th Edition, Addison-Wesley, ISBN 978-2848-4372-38


Markov chains may indeed be very cool, but that is independent from the notation used to represent them.
08/06/08 @ 08:34
Comment from: Daerin the Cluer [Visitor] Email
Jose: To someone who has never seen the outside of a classroom, math notation can seem very natural. To everybody else, it looks like Egyptian hieroglyphs.
08/06/08 @ 09:02
Comment from: Daerin the Cluer [Visitor] Email
I just looked at that Sussman video. He correctly observes that math notation is a natural language, which evolves in unpredictable ways, and leaves much open to interpretation. He mentions a book he wrote which supposedly describes physics using the Scheme programming language instead of math notation (or that's what he said in the video).

The book is available for free online: http://mitpress.mit.edu/SICM/book.html

As it turns out, what Sussman advocates in the book is not using Scheme instead of math notation, but using a math notation he calls "functional math notation", which can supposedly be translated into Scheme if you use his scmutils library. Even though one of his complaints about math notation is that it's a natural language, his "functional math notation" is also a natural language, and it is certain to eventually diverge from the Scheme equivalent, until exact translation is no longer possible.

The notation he proposes is a typical obscurantist scheme. Here is an example:

http://mitpress.mit.edu/SICM/chap1-Z-G-2.gif

A rough translation into the allowed HTML looks like this:

F[ϒ] = LT[ϒ]

The blurb "F[ϒ]" is described as being a "function", even though the appendix in the book that describes this notation says that function-calls are f(x), while the function considered by itself is f.

The Greek letter is another function. It is not clear whether it is being passed as an argument to F, or if F is actually a data structure that contains many functions, and "F[ϒ]" selects one of those functions. Or maybe "F[ϒ]" is a composition of the two functions, and the result is a curried function.

The blurb sort of looks like an array reference in most programming languages, but then again, math notation sometimes uses brackets for multiplication. It is explicitly stated in English that the blurb
"F[ϒ]" represents the one function, but T is a function on its own, while "[ϒ]" modifies it somehow.

The blurb is never given in its Scheme equivalent, but Sussman introduces some of that in the next chapter. When he does, he makes sure not to disappoint the mathematicians' taste for obscurity.

He eventually introduces the concept of function composition. All that means is that ((compose foo bar) baz) == (foo (bar baz)). It is actually used above: That's what the ∘ character means in math notation.

But at this point, he wants to introduce what he claims to be a separate idea, that is the composition of "operators". This idea is supposedly so "separate", that a different operator is used to describe it. What operator? The multiplication operator.

That's right. The scmutils library overloads Scheme's * function so that, when a bunch of "operators" are passed to it, it returns the composition:

(* foo bar baz) ==> (lambda (args) (foo (bar (apply baz args))))

The "*" function does something subtly different if you pass functions that are not deemed to be "operators":

(* foo bar baz) ==> (lambda (args) (apply * (map (lambda (x) (apply x args)) (list foo bar baz))))

For numbers, the * function still multiplies them.

So this means that the normal * function is replaced by a function that does at least 3 totally unrelated things, depending on the types of the arguments given to it. Next, it should be extended so that if it's called with a list of strings, it starts a home-brewed Lisp interpreter with an integrated, bloated text editor and interprets the strings as the names of files to be edited.

BTW, the +, -, and / operators are actually functions, not operators, according to the * function. "D" OTOH, is an operator. They are treated totally differently.

Sussman brings mathematical obscurity to programming, while claiming to do the exact opposite.
08/06/08 @ 21:51
Comment from: Jon Hangman [Visitor] Email · http://playhangmangames.com/
Yeah i agree with you really a nice post :)

http://playhangmangames.com/
02/05/09 @ 15:23
Comment from: Denis [Visitor] · http://thissidein.org
I guess incomprehensibility of mathematical notation comes with a general attitude towards non-mathematicians.

Most programmers are "lazy" and try to eliminate complexities that are not inherent in a certain task or concept. If an algorithm can only be understood by a handful of experts, can it be made simpler? If the code is hard to read, can it be made more readable? If a program requires weeks of studying, can it be made easier, even to the point where untrained users can understand it? I think that's what many modern languages and programming paradigms were designed for - to eliminate complexities that are not inherent to programming itself.

Mathematical field, on the other hand, seems much more elitist. If you can't even handle the notation, you have no business doing mathematics. I personally am willing to agree that if you don't understand mathematical notation, it's probably because you're a bad mathematician. But that's kind of the point here, isn't it - you have to be a good mathematician to understand mathematical notation, while you can be a pretty horrible programmer and have no difficulties with syntax itself.
13/10/09 @ 21:56
Comment from: Patrick C [Visitor]
Mathematical language is just more flexible than programming language. This necessarily means that programming languages are more uniformly used and hence understandable, whereas the precision and clarity of mathematical language depends largely on the person using it.

You can write mathematical language as clearly as you want. Programming languages are mathematical in nature, because they are used to make systems with provable characteristics. The post shouldn't draw a distinction between mathematical and programming languages, but between uses of mathematical language so precise and unambiguous a computer can understand it, and uses of mathematical language so abstract and obscure you need three PhDs to get past the first line.

All this being said, a lot of CS people don't see it that way which, IMHO, is why so much software is trash. Also, most math people don't like formal representation systems either, which is generally why they are such trashy programmers. What we need is CS people to understand is that programs are proofs written in a kind of unambiguous mathematical language, and math people to understand that programming is an activity of no less weight than constructing a valid proof with impeccably precise syntax.
12/06/10 @ 18:26
Comment from: junioridi calzzani [Visitor]
Why should we spend so much time learning how to describe a solution instead of spending the same amount of time trying to reach it?
24/06/10 @ 16:47

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
powered by free blog software