PyCon US 2017 finished more than a month ago. By the miracle of the technology everyone not able to attend can watch all the talks conveniently in ones own home. So I did, starting with a list of talks recommended in one of the recent episodes of Talk Python To Me podcast.
At this point it’s easy to guess that this post will be about PyCon and most probably about one of the talks I enjoyed. The talk I’d like to focus on is Instagram Keynote delivered by Lisa Guo and Hui Ding. Really good performance and really good slides, in my opinion top notch delivery.
Both speakers have been talking about migration of Instagram from Legacy Python (2.7.x) to Modern Python (>3.5). Massive user base and codebase, outdated third party packages without Python3 support, outdated Django (super old), and Legacy Python. Sounds like a typical software endeavour, maybe except user base being massive. When speaking of challenges Lisa mentioned something that really surprised me. I’m talking about the issue with iterators and builtin function any
. Here is the slide she used to illustrate the problem and below is the code in question.
CYTHON_SOURCES = [a.pyx, b.pyx, c.pyx]
builds = map(BuildProcess, CYTHON_SOURCES)
while any(not build.done() for build in builds):
pending = [build for build in builds if not build.started()]
<do some work>
This piece of code works fine under Legacy Python but when run under Modern Python first element of builds
list is lost in the while
loop. builds
list inside of while
is missing an element. The cause of it is the change of return value of map
function in Modern Python from list
into iterator
. This is something new to me and I really had to know the reason why first value is lost.
Why does it happen?
It would be much easier to reason about this with simpler code example.
Python 3.5.3
>>> m = map(str, [1, 2])
>>> any(m)
>>> print(list(m))
['2']
Puzzling, I was wondering if same thing would happen with Legacy Python. In order to make it work result of map
has to be converted into iterator.
Python 2.7.13
>>> m = iter(map(str, [1, 2]))
>>> any(m)
>>> print(list(m))
['2']
This behaviour is shared among Python versions and most likely is intended behaviour, although not documented one I guess.
But why does this happen? Answer to this question as usual requires us to look into how Python is implemented. Implementation of any
function needs to be examined. So whip up Python/bltinmodule.c
file out, it can be any version of Python. Below is mentioned function taken from source of Python 3.5.1 as it’s most recent source I have.
static PyObject *
builtin_any(PyModuleDef *module, PyObject *iterable)
{
PyObject *it, *item;
PyObject *(*iternext)(PyObject *);
int cmp;
it = PyObject_GetIter(iterable);
if (it == NULL)
return NULL;
iternext = *Py_TYPE(it)->tp_iternext;
for (;;) {
item = iternext(it);
if (item == NULL)
break;
cmp = PyObject_IsTrue(item);
Py_DECREF(item);
if (cmp < 0) {
Py_DECREF(it);
return NULL;
}
if (cmp == 1) {
Py_DECREF(it);
Py_RETURN_TRUE;
}
}
Py_DECREF(it);
if (PyErr_Occurred()) {
if (PyErr_ExceptionMatches(PyExc_StopIteration))
PyErr_Clear();
else
return NULL;
}
Py_RETURN_FALSE;
}
Looking at body of any
we can tell that there is no difference if you call it with iterator or iterable. Both objects can be iterated on, difference is in what PyObject_GetIter
returns when either list or iterator is used.
As we know iterators work tirelessly till exhausted, but list is not an iterator. List is an iterable. List when iterated returns fresh iterator each time, thus is never exhausted. Each for
loop operating on a list gets fresh iterator with all the values.
In our case important lines are these:
it = PyObject_GetIter(iterable); // This retrieves an iterator
...
item = iternext(it); // This consumes an element
any
returns True
if any element is true. When running on Modern Python any
will consume each element up to one that is true. Cause of it is the loop and iternext
call that subsequently removes elements from the iterator. This simple example illustrates the behaviour.
Python 3.5.3
>>> m = map(bool, [False, False, True, False])
>>> any(m)
True
>>> print(list(m))
[False]
One other function came to my mind. How would all
behave? Mechanics of it seems to be the same as with any
. Answer is pretty simple when we look at the source.
static PyObject *
builtin_all(PyModuleDef *module, PyObject *iterable)
{
PyObject *it, *item;
PyObject *(*iternext)(PyObject *);
int cmp;
it = PyObject_GetIter(iterable);
if (it == NULL)
return NULL;
iternext = *Py_TYPE(it)->tp_iternext;
for (;;) {
item = iternext(it);
if (item == NULL)
break;
cmp = PyObject_IsTrue(item);
Py_DECREF(item);
if (cmp < 0) {
Py_DECREF(it);
return NULL;
}
if (cmp == 0) {
Py_DECREF(it);
Py_RETURN_FALSE;
}
}
Py_DECREF(it);
if (PyErr_Occurred()) {
if (PyErr_ExceptionMatches(PyExc_StopIteration))
PyErr_Clear();
else
return NULL;
}
Py_RETURN_TRUE;
}
As we can see this function also loops over iterable using an iterator. Condition however is slightly different as it needs to verify if all elements are true. This will cause all
to consume each element up to first false element. Again simple example will illustrate this best.
>>> m = map(bool, [True, True, True, False, True])
>>> all(m)
False
>>> print(list(m))
[True]
It has been bugging me since I watched the video, now I have my closure. I learned a bit while investigating this and I’m happy to share it.
P.S. Yes, it is Legacy Python.