Pages: << 1 ... 3 4 5 6 7 8 9 10 11 12 13 ... 16 >>
When you search for a list something and don’t find it, you have three options that work generally:
Typed languages* always choose the third option because they have an easy wrapping system built-in: type definition. Plus typed languages believe in safety, and the error checking is required this way. Untyped languages* vacillate between the three. The dirtier the language (eg Arc, Perl or C)), the more likely that option 1 is the one used, because it allows you to write puns in which the ambiguous failure value is used as if it were valid. Unfortunately, it means that error checking is not required and does not take any single, recognisable form. The majority of languages seem to have compromised on option 2: you don’t have to check for errors, but when you do, you always use try/catch syntax.
I like all three options, actually. They have different trade-offs, and which you use depends on how you intend the code to be written and read. This brings us to Arc.
This suggestion on the Arc forum wants to make option 3 standard in Arc. To me, this obviously flies against the spirit of Arc. It’s very likely that if you’re searching a list of lists for one shorter than 3, then nil (the empty list) as the return value is fine–you’re going to add something to the returned list anyway. Furthermore, if you *do* want to wrap the return value, you don’t need to define a new type for it. The One Data Type philosophy of Lisp says that you already have a wrapper called cons.
But of course you’re free to use option 3 in your own code and make it pretty with some macros to help you define data constructors. Why not?
But this is precisely why I became disillusioned with Scheme. I really hate duplication, and in the Scheme world, every programmer has extended Scheme with identical, but slightly varying, language extensions. It’s like string classes in C++. The promise of Arc as a P-language** is precisely that this fragmentation won’t happen. But it probably will, because this ability seems to be the main attraction of macros. So I’ll continue to use Python which has this ability deliberately left out. Right now I wish I had never taken Friedman’s Programming Languages course. Ignorance is bliss.
Anyway, I am writing this here because I don’t want to write something so bitter and off-topic on a public forum. I restrict myself to complaining to myself in public here on this blog.
* Typed language is academicspeak for “statically typed language". The language definition itself is typed, as opposed to the runtime, which is where the type-checking happens in a “dynamically typed language” (called ‘untyped’ in academic). Some languages mix the models, like Java, and have crappy type checking both in the compiler AND in the runtime. I like languages that shift as much of the work as possible one direction or the other.
**One of Perl,Python,Ruby,PHP,Javascript,etc.
Post-script: About the only happy Lispers before Arc were the Common Lispers who didn’t get into arguments with Schemers or C++ programmers. But those CLers are delusional, consisting of people who compare their ’superior’ language to languages of the early 90s, or kids who just learned Java and then skipped straight to Common Lisp. The fact is that there are now many languages that are a *lot* like Lisp, and their practical advantages usually outweigh the sheer power the baroque facilities of CL give you.
I was listening to an old favourite radio station of mine, Deep Mix Moscow, and I realised that I was reading the headings even though they are in Russian. That’s because I took a survey course of the Central Eurasian languages last year which required me to learn the cyrillic alphabet. And Russian shares a lot of words with English, like “Programma", “Forum” and “statistika".
Fortunately, you can do the same. The cyrillic alphabet is super easy if you already know the latin one. It just shuffles the letters around a bit and adds a few letters for ‘ch’ and the french ‘g’ sound in ‘espionage’. You can pick it up in a couple of days. Here’s a good place to get started: http://www.geocities.com/Colosseum/Track/7635/alphabet.html. Yes, I know it’s geocities. You could also check Wikipedia.
Note: Hand-written cyrillic is harder to understand than type-written. Volya, a Russian linguistics student in the same Central Eurasian class, wrote some stuff on the board which made no sense even after learning the type-written font. Fortunately handwriting is somewhat rare these days.
There’s a jargon term for blog spam, but I can’t remember what it is. Anyway, Josh Rose last year bought a Wal-Mart Special flat-screen TV which was *obviously* manufactured by a company that was the absolute lowest bidder and probably didn’t exist before 2005. It broke a year later, right after the warranty expired. Shockingly, Wal-Mart did the same thing in 2001 for DVD players. All my friends came back to college with Wal-Mart Specials that broke in two weeks. Or never worked. This is what happens when you have to carry the season’s hot item AND have the lowest price. Oh well. Josh is a loyal Wal-Mart employee I guess, or maybe was trying to drive up the stock price. He had the right response after trying to get it fixed, though. He bought a Sony TV and made a web site to warn others by tweaking Viore’s home page.
While you’re still reading, let me draw your attention to the request for beta testers for Our Word!, a program designed to help people who have never touched a computer before start translating the Bible from a related language to their native language. They need people to play around with it and file bug reports.
If you have an installation of Windows lying around, install it and see if you can get it to crash. It’s mostly written in .NYET, so it only *really* crashes when you manage to crash the underlying C++ edit control written by an SIL team to handle weird writing systems. Also, the example translation is from Hawaiian Pidgin to English, which is fun. The Hawaiian Pidgin name for the Bible is “Da Jesus Book", for example.
Oh, and if you’re *still* reading, here are some more questions I’ve never got answered. Why do all the undergraduates form lines to get on the bus? What’s wrong with a big crowd? Does it really matter what order you get on the bus? Why do all the girls thank the bus driver when they get off? He’s paid for his job. He probably has to say “You’re Welcome” about a thousand times a day. Why are half the undergraduate girls on the bus hoarse when they talk? Is it the socially acceptable voice for unmarried women in Indiana? Or have they all been out drinking the night before? What are the 80% of kids with iPods on the bus *listening* to? It’s got to be pretty low quality because of the noise from the bus engine.
Writing a summary of Noah’s thesis work made me realise that I have never summarised my own work very well and that probably most of my friends don’t have a good idea of what I do. So I’m going to make a trio of posts and see how confusing they are. Once I’ve got the bugs worked out I’ll probably solidify the explanation on my home page.
The linguistics I do falls into three areas: work, school and fun. The work area is for my research assistantship with Steve Chin, studying cochlear implant speech. We study the linguistic development of the early implanted kids–the children’s hospital at Indy is a centre for very early implants. The school area is work that I hope will lead to a completed dissertation at some point. Most of it has been computational dialectometry–using a computer to measure difference between varieties of language. The last area is a ‘hobby’, computational Optimality Theory. This is a hobby because computational linguists don’t really believe OT is a good mental model and because phonologists don’t like thinking like a computer. So it’s too unpopular to really succeed–there are only a couple of people who work on it. But it’s a lot of fun because it leads to very tricky code.
I’ll start with my work research because that was the first real research I did; I was still taking classes and working for Steve forced me to start original research. Phonological research on cochlear implant users proceeds in two directions: the first direction is research in limits of what kinds of language humans will learn, even with degraded, weird, electronic input*, and whether an algorithm can be made to imitate that kind of learning. The second direction is getting a computer to classify CI users’ errors so that particular pathologies can be identified and the kids can be taught to improve using that classification. Using a computer is good for this because it means you don’t have to have a resident linguist at every hospital. This second direction is what helps convince the NIH to fund the project.
During my part in the project, I’ve worked on measurement of development of distance measures. I’ve measured distance between adult English ("standard” English) and implant users and between implant users themselves. That sounds really impressive, but the code behind it is really simple. It’s just Levenshtein distance (string edit distance) and another distance measure that uses a naive Bayes classifier. I mean, here’s the code, basically. (I leave off some initialisation code and use some Python features that only exist in my imagination)**
def levenshtein(s1,s2,(ins,del,sub)):
for i in len(s1): # this is an imaginary python in which numbers provide iterators
for j in len(s2): # wouldn't it be lovely?
lev[i][j] = min(ins(lev[i-1][j]), del(lev[i][j-1]), sub(lev[i-1][j-1]))
return lev[-1][-1]
# if you are short on imagination, here are simple ins/del/sub implementations
def simple_ins(c):
return 1
def simple_del(c):
return 1
def simple_sub(c1,c2):
return 0 if c1==c2 else 2 # that is totally real python, can you believe it?
def maxlikelihood(s1,s2):
likelihoods = dct.count(lst.window(s1,2))
return sum(likelihoods.get(bigram, 0) for bigram in lst.window(s2,2))
# if you are short on imagination, here is dct.count and lst.window
def dct.count(l): # more imaginary python
d = {}
for x in l:
d[x] = d.get(x,0) + 1
return d
def lst.window(l, n):
return [l[i:i+n] for i in len(l)-n]
I had to do some human experiments to verify that these distances correlate with human judgements. That was a pain and I hope not to have to work with humans again. However, nobody else has done this, so it was useful as a check to make sure that Levenshtein and max likelihood distances are measuring the right thing.
Recently I have used the distances between speakers to create clusters. This is a standard algorithm, and it’s pretty common in dialect work. The algorithm is more or less as follows: (I haven’t written this as many times as the previous code because I started using Matlab (x_x) because it also generates graphs)
# group_distance must already be defined as some
# clever measurement of distance between groups
def hierarchical_clustering(speakers):
groups = set(map(mklist, speakers))
while len(groups) > 1:
x,y = min_by(group_distance, all_pairs(groups))
groups.remove(x)
groups.remove(y)
groups.add([x,y])
return groups
# again for those of little imagination...
def mklist(x): return [x]
def min_by(f, l):
it = iter(l)
best = it.next()
bestval = f(best)
for x in it:
xval = f(x)
if xval < bestval :
bestval = xval
best = x
return best
def all_pairs(l):
acc = []
for i in xrange(len(l)):
for j in xrange(i+1, len(l)):
acc.append((l[i],l[j]))
return acc
This is a completely standard algorithm. I am writing it from memory so if you will check Cormen et al you will probably find some errors. Anyway, the cool part about this clustering is that the you can analyse the similarities between the speakers that cluster together to see what features they share. The algorithms extract features like “substitutes ‘w’ for ‘r’” for one cluster.
So that’s what I’ve done so far. In the future I’m interested in trying some machine learning algorithms to see if the machine can actually learn some phonological properties, like what allophone bundles map to which phonemes. That is, which low-level sounds do implant users perceive as the same? For example, in American English we perceive aspirated ‘t’ and flapped ‘t’ to be pretty much the same. The ‘t’ in “pretty” is usually a flap, it sounds a little affected to say it with extra aspiration, only fit for saying “come here, my pretty -cackle-".
In contrast, it’s pretty clear that a lot of implant users have trouble learning this. A lot of them put this extra burst of air at the end of words like “boot(h)", for example, which an adult normal-hearing English speaker would never do.
*I like to listen to Kraftwerk and C64 tunes to see what sort of musical taste forms with degraded, weird, electronic input.***This was going to be a post recommending the Ph D thesis of Noah Silbert, who is a phonetics student at IU. But he hasn’t written up his dissertation yet that I can see. So I’ll post a link when he’s done, I guess. His hypothesis is that our perception of segments* can be modelled by a few binary features that are independent in a few important ways. Then he measures how much error the hypothesis has, and which model fits the data the best. The upshot is that his hypothesis is mostly correct and that [b] is perceived as minimally distinguished from [p] and [d] by one binary feature each. ([voice] for [p/b] and [place] for [b/d].)
This is exciting because (1) it proves that Bob Port’s Doomsday scenario is not true ("it’s all mush! these features are a figment of your imagination! the end is nigh!") (2) it suggests that Bob Port’s calmer model of exemplar models working for dimensional reduction might actually make sense. (Though Bob’s primary contribution to all this is *still* breathless panic that causes people to sit up and and do some experiments to see if it’s justified.)
Oh, right. Windows font smoothing. I’m reading a DOC on Windows and since I normally only see Windows’ font smoothing on screen shots, I keep expecting to see JPEG artifacts around the corners of the letters. Needless to say that hasn’t happened yet.
*A phonological segment is, more or less, equivalent to a letter in the spelling. Of a language with decent a writing system, ie NOT English.