« Monadic parsing in Python, part 2Monadic parsing in Python, part 1 »

Python's iterators are a bad implementation of laziness : with concrete example

29/06/09

Permalink 11:38:25 am, 460 words
Categories: Code, Linguistics

Python's iterators are a bad implementation of laziness : with concrete example

Every time I tell myself I should lighten up and learn to enjoy Python’s lightweight imperative implementation of laziness, a non-local-dependency bug bites me. Here’s the most recent one:

In a script, I changed the following fix-up code from

def fix(node):
    return '-'.join(map(str.strip, reversed(''.join(node).split(','))))

to

def fix(node):
    snode = ''.join(node)
    if snode==')':
        return 'RRB'
    elif snode=='(':
        return 'LRB'
    else:
        return '-'.join(map(str.strip, reversed(''.join(node).split(','))))

Can you see the bug? No? Well, I do a little work twice, joining a list of chars into a string: snode = ‘’.join[node]. I wasn’t sure that this ad-hoc checking for parentheses would work so I didn’t want to refactor the original expression, which also contains ‘’.join(node).

The problem is that node is not a list of chars, but a generator of chars. So it’s use-once. And I used it twice, but the second time, it was unfortunately empty. Oops. Even less fortunately, my program didn’t crash when fix returned an empty string for most inputs, so I didn’t know about this until my script had run for about 5 minutes.

To diagnose this bug, I had to jump up two functions to the main code, then back down two functions into another module. This code is quite lazy, with yield and generators all over the place, so it doesn’t actually read the individual nodes until much later, when fix is called. I thought laziness was the New Python 3.0 Way, but apparently the New Python Way is susceptible to hairy non-local bugs. This is the essence of laziness, to collapse all your nicely-separated code into one giant executing mess, but running into bugs based on the non-locality of the implementation means that the implementation has holes in its abstraction layer.

The use-case is: I create a generator somewhere, then return, pass it around a lot, and finally use the generator twice somewhere else entirely. Remember, this is Python, so I am not constantly reminding myself about the type of something, and, from previous versions of Python and other languages, I expect to be able to use variables more than once.

For example, C# prevents this “use-once” bug by wrapping an additional layer of abstraction around its enumerators. When you create a generator, you don’t get an Enumerator, you get an Enumerable. Using the Enumerable creates a new Enumerator each time, so my above program would have become twice as slow, but at least it wouldn’t have become incorrect.

This is Bad and Wrong and I would suspect that Python 3.4 or whatever will adopt the C# model, *except* that it’s such a glaring problem that somebody surely noticed it during development of 2.4-3.0. It could be that Guido doesn’t like the use-case I present here.

13 comments

Comment from: alexandru [Visitor]
iter(object) gives you a new iterator each time is called. Enumerable is just an extra useless layer.
29/06/09 @ 13:48
Comment from: tom [Visitor] Email
Hey,

this behaviour is indeed annoying, but can be fixed by "copying" the iterator:

return '-'.join(map(str.strip, reversed(''.join(iter(node)).split(','))))

29/06/09 @ 13:50
Comment from: John [Visitor] · http://johnandkaren.com
Yes, the non-repeating nature of generators is a little unintuitive to start with, especially when you are familiar with the way lists work. But, they are no more unintuitive that floating point rounding errors (1.1 + 2.2 != 3.3). Simply learn the type and move on.

And if you need to repeat the generator, use itertools.cycle.
29/06/09 @ 13:54
Comment from: David W [Visitor] Email
The problem is not the iterator behaviour, it's the fact your function includes no docstring nor assertions to verify its inputs, thus providing no clue either to the programmer or the system that something is wrong.

import operator

def fix(node):
"""
Perform some node fix-up.

@param[in] node Sequence of stuff.
@returns Sequence of fixed stuff.
"""

assert operator.isSequenceType(node)
return ''.join(do_stuff())
29/06/09 @ 14:40
Comment from: Derek [Visitor] Email
The problem isn't that Python's implementation of iterators are bad. The problem is that you are misusing iterators. Reading iterators is destructive. This is the nature of iterators.

Are you also surprised when you open a file, read the whole thing, and then find that the file handle now points to the end of the file instead of the beginning?
29/06/09 @ 14:58
Comment from: Uh Clem [Visitor]
You either don't know, or disregarded the semantics of iterators and now you're claiming that it's a problem with Python. How embarrassing.
29/06/09 @ 15:20
Comment from: Dave [Visitor] Email · http://www.example.com
If you use snode (that you already calculated in line 1 of the function) in your else clause instead of ''.join(node), your problem goes away.

Also, it'll be slightly faster and use less memory.

Minus one point for bad refactoring.
29/06/09 @ 15:21
Comment from: sandersn [Member]
@Dave, @David W:
This is a script that I wrote to convert a corpus to a different format. The duplicate use of node was a quick addition I wasn't sure would work. Yes, it's badly factored and documented. I'm never going to look at this code again once it works. (Which it does, finally: snode = ''.join(node).replace('(', "LRB").replace(')', "RRB"); return '-'.join(map(str.strip, reversed(snode.split(',')))).)

Nonlocal dependencies wreck scripting use-cases, because you (1) have to know your entire program—libraries aren't black boxes anymore—or (2) you have to follow Best Practises like "wrap every duplicate use of a parameter with iter" or "match every malloc with a free" or "HTML escape every string before it's presented to the user". Never-screw-up practises like that are acceptable in C, but not in Python, especially when throwing together a quick script.

@Derek, @John:
Yes, I expect math to work and files to behave transparently like strings in memory (or list of string if that's what I request). Compounding my surprise in this case is that I used Python 2 for most of a decade, ignoring iterators in the later versions. I was not expecting the giant warts added starting in 2.3 (2.2?) to be left unfixed in Python 3, and in fact made a central part of the language.

Because of the nonlocality of the bugs, you can't say "learn the type and move on" any more than you can say it with the pointer type in C. Experienced C programmers still write buffer overrun bugs even when they *know* how pointers work. Basically, I started using Python in part because at one time you didn't get any results when googling for "Python pitfalls". This is definitely a Python pitfall.

@Uh Clem:
I *know* the semantics of iterators. But I can't *remember* them when writing code. The iterator abstraction leaks through scopes as badly as the pointer abstraction. Python should at least default to using iterable and giving out a new iterator every time the iterable is used.

@alexandru, @tom:
x = iter('abc'); y = iter(x); print(list(x)); print(list(y)) gives ['a', 'b', 'c'] then [].
29/06/09 @ 17:53
Comment from: Derek [Visitor] Email
> Yes, I expect math to work

Then you totally picked the wrong profession. Integer values can overflow, underflow, truncate during division, etc. Floating point values are approximations and direct comparisons can fail on theoretically equal numbers. Neither behave like true mathematical integers or reals.

> and files to behave transparently like strings in memory (or list of string if that's what I request).

Why would files behave like that? Sure, you could read the file into memory, but then you aren't dealing with the file anymore. You can also read your iterator into memory and avoid the problem you ran into. Is there a way in Python to simply open a file and pretend it's a string (or array of strings)?

> Compounding my surprise in this case is that I used Python 2 for most of a decade, ignoring iterators in the later versions. I was not expecting the giant warts added starting in 2.3 (2.2?) to be left unfixed in Python 3, and in fact made a central part of the language.

I don't think your surprise makes iterators warts. I understand that you could be surprised by their semantics, but I don't see how this is a failing in Python. Lots of languages include iterators with these same basic semantics.

> Because of the nonlocality of the bugs, you can't say "learn the type and move on" any more than you can say it with the pointer type in C.

I don't agree that the bug was nonlocal. You had to dig through code to find the source, but that's because you weren't familiar with iterator semantics. If I didn't understand how hashes worked, I might have to go digging to find out why elements disappear when I insert them twice. That doesn't mean that the bug is nonlocal. I inserted the same element, causing the previous to be overwritten. You tried to read from an empty iterator. In both cases, the source of the bug is local to the manifestation of the bug.

> Basically, I started using Python in part because at one time you didn't get any results when googling for "Python pitfalls". This is definitely a Python pitfall.

It's not a Python pitfall. It's a lack of understanding pitfall. The semantics are implemented correctly. The only thing wrong is that you didn't understand (or forgot) the semantics. If you misuse a hash table by filling it with objects that all hash to the same value, that's a problem with what you did, not with the language/library.

> Python should at least default to using iterable and giving out a new iterator every time the iterable is used.

So get a new iterator every time you need it. There's no reason you have to use the same iterator, except that you wrote your function to expect one. Accept an iterable and call iter on it. That's the use case you want, so make it happen instead of complaining about Python. (By the way, you can also pass an iterable to join instead of an iterator.)

> x = iter('abc'); y = iter(x); print(list(x)); print(list(y)) gives ['a', 'b', 'c'] then [].

Yes, because you are calling iter on the iterator itself. What you need to do is call iter on the original iterable (or create two generators).

I think this really says it all:
"The problem is that node is not a list of chars, but a generator of chars. So it’s use-once. And I used it twice"

You tried to use an iterator as a lazy list, and it didn't work. It's not supposed to work, because an iterator is not a list. As long as you keep thinking of iterators as lists, you will have problems with them.
29/06/09 @ 19:03
Comment from: Reid [Visitor] Email
Your proposed solution to the problem doesn't actually work. How long do you think it will be before someone write a three-liner convenience generator function that takes an iterator and makes it an iterable? There's really no point. I think it's really annoying how in Java iterators are not iterable, since it makes it hard to define multiple kinds of iterators for a collection.

Also, making iterables "reusable" by default totally blows away the memory savings that you might otherwise achieve by using them over lists.

I agree it's slightly confusing, and I think programmers who don't understand how iterators work in Python should probably just use lists instead.

Perhaps a better solution would be to "fail fast" by having the iterator raise an exception if you call next() twice past the end of the input, but then it's hard to pass a partially exhausted iterable and then know if it is exhausted.
29/06/09 @ 19:14
Comment from: enjoydoingitwrong [Visitor] Email
> I *know* the semantics of iterators. But I can't *remember* them when writing code.

Sorry, don't blame Python's laziness. Blame _your_ laziness. Document more you code, use meaningful function names, that will help you to remember what happens in your functions.

fix is not even a half decent function name, it's not documented, your one-liner is terrible to read... don't get surprised if you get bugs that are hard to diagnose, you're asking for it.

> For example, C# prevents this “use-once” bug by wrapping an additional layer of abstraction around its enumerators [...] so my above program would have become twice as slow, but at least it wouldn’t have become incorrect.

It's one of the characteristics of Python. If the programmer wants to do stupid things, he is allowed to do so. Do you want to have someone guiding you during your whole life, or do you prefer having a language forcing you to be responsible? :)
29/06/09 @ 19:54
Comment from: jae [Visitor] Email
The OP is referring to this issue http://bugs.python.org/issue5973, where it is suggested that a generator function return an iterable instead of an iterator.

This is a good suggestion, and I'll post it on python-ideas, as suggested by David Murray in the above link.

*That* said,

==================================


>> The problem is that node is not a list of chars, but a generator of chars.

no, node is whatever you pass in.

>> This is the essence of laziness, to collapse all your nicely-separated code into one giant executing mess

good luck.

>> I create a generator somewhere, then return, pass it around a lot, and finally use the generator twice somewhere else entirely.

problem identified!

===================================

I updated the above link with a workaround function decorator. It's inefficient but should work for some needs.
29/06/09 @ 21:16
Comment from: jae [Visitor] Email
On second thought, I think generator functions should indeed return iterators and not iterables.

1. It is much easier to wrap a generator function into an a function that returns iterables, than to change the behavior of Python

2. There are gotchas in either case, as some people may mistakenly assume that generators return iterators (if it actually returned iterables).

For example, if generator functions returned iterables, one could easily write code like the following:

def foods():
""" returns some food """
for i in range(10):
yield "cookie%i" % i

def eat_some(foods):
""" eat some food """
for food in foods:
print "eating %s" % str(food)
if True: # or any condition of being full
break

my_foods = foods()
eat_some(my_foods) # eats from cookie0
eat_some(my_foods) # regurgitation?!
eat_some(my_foods) # regurgitation?!


I say, it would be more confusing for generator functions to return iterables, since conceptually, it seems like I only called the procedure in "def foods" once!


29/06/09 @ 21:51

Comments are closed for this post.

Nathan Sanders : Journal

powered by b2evolution free blog software