Length hints in Python

I was grubbing about trying to understand some library code the other day (I forget which one) and I came across a double-underscore method I’d never seen before:

    def __length_hint__(self):
        return ... # Some internals here.

Since this is a double-underscore method, it can’t be (or shouldn’t be) something the library invented. It must have significance to the Python runtime. A quick search on the manual shows that this is the protocol by which a class can implement the operator.length_hint operator (duh). But of course I’ve not seen that operator either.

length_hint is slightly unusual. Most of the functions in operator correspond to built-in functions or operators accessed via special syntax (like +, == etc.) These are always available regardless of what namespaces you have imported, but are also available as properly namespaced named functions in operator, which can be useful if you want to store them in variables and use them as function arguments. However, length_hint is relatively new and thus doesn’t have any corresponding built-in function.

Anyway, length_hint is defined to return “an estimated length for the object”. It’s actually defined in PEP-424, which makes it clear that the return value of length_hint may be wrong, and may give up and return a default value (the __length_hint__ method is allowed to raise NotImplemented). What’s the use of a function that doesn’t return the right value, or even any value at all?

A simple example

This is just a toy example. Nothing actually works the way I’m going to describe it, but hopefully it will give you the idea of what’s going on.

Let’s play the role of the Python runtime for a bit. Assume we have some empty memory:

The blocks in this diagram could be bytes or multi-byte blocks or memory pages or whatever, it doesn’t really matter.

We want to create a list of prime numbers that we’ve read in from a file. We create the list called primes and load the first three numbers from the file:

Just then, we take a break from this process and need to create another object. Perhaps it’s a log message? One of those kinds of things that happens when you’re in the middle of doing something else.

Having done that, we go back to filling our list with prime numbers from the file. But you can see the problem we run into: if there’s more than 1 more prime number in the list, we end up crashing into the other block of memory.

As you can imagine, this kind of thing happens all the time. Software would get nowhere without a way to deal with this. One of the objects will have to be moved, and that means copying memory around, which wastes time.

It would be so much better if we could tell ahead of time how big our list needed to be. If we did that, we could pre-allocate enough space for the whole list of prime numbers (we’re reading it in from a file, so it must be finite).

This is where it could be useful to know the approximate length of what we’re dealing with. For a file, we can get the file size (in bytes) quickly without having to read every byte of the file. This doesn’t tell us exactly how many numbers are in the file, because different numbers may take up different amounts of space. But if we assume 3 bytes per number (2 characters plus a newline for each one) we can divide the length by 3 and make a guess.

Given our guess, we can pre-allocate enough space for our list and put the log_message item in memory somewhere where we won’t end up having to move it.

Of course, the crude method of dividing the file length by 3 might fail, and we end up not leaving enough space. But that leaves us no worse than if we hadn’t used the length hint and just dropped the log message wherever we felt like it.

Again, none of this is how things actually work, but it’s an idea of why this might be useful.

This worked because we knew we were using a file, and we know things about files (they have a length known without reading the whole file, they include line ending characters etc.). But as good programmers, particularly good Python programmers, we don’t want to have our code making too many assumptions. We want a way to get a length hint from anything: a file, a TCP socket, a list, a generator, or even a custom class that hadn’t even been written when you first wrote your code.

How length hints are actually used

It’s easy to miss the use of length hints in the Python runtime code. Length hints aren’t used in the parts of the code written in Python, only by the C code. If you simply grep the source for __length_hint__ you’ll find it being mentioned a lot, but virtually all the references are the various implementations for built-in types (list, tuple etc.) and don’t show it in use.

Where length hints are used by cpython runtime C code, it is called via the wrapper function PyObject_LengthHint (in Objects/abstract.c). This includes the following code:

Py_ssize_t
PyObject_LengthHint(PyObject *o, Py_ssize_t defaultvalue)
{
    PyObject *hint, *result;
    Py_ssize_t res;
    _Py_IDENTIFIER(__length_hint__);
    if (_PyObject_HasLen(o)) {
        res = PyObject_Length(o);
        if (res < 0 && PyErr_Occurred()) {
            if (!PyErr_ExceptionMatches(PyExc_TypeError)) {
                return -1;
            }
            PyErr_Clear();
        }
        else {
            return res;
        }
    }
    hint = _PyObject_LookupSpecial(o, &PyId___length_hint__);
    if (hint == NULL) {
        if (PyErr_Occurred()) {
            return -1;
        }
        return defaultvalue;
    }
    result = PyObject_CallFunctionObjArgs(hint, NULL);
    Py_DECREF(hint);
    if (result == NULL) {
        if (PyErr_ExceptionMatches(PyExc_TypeError)) {
            PyErr_Clear();
            return defaultvalue;
        }
        return -1;
    }
    else if (result == Py_NotImplemented) {
        Py_DECREF(result);
        return defaultvalue;
    }
// ...

First it checks whether the object in question is of a standard sequence type, in which case it uses the standard method of getting the length directly. This is the check of _PyObject_HasLen. A real length will always be better than a length hint, if it is available.

Then it looks for a __length_hint__ member, bailing out if it doesn’t find one:

    hint = _PyObject_LookupSpecial(o, &PyId___length_hint__);
    if (hint == NULL) {
        if (PyErr_Occurred()) {
            return -1;
        }
        return defaultvalue;
    }

Assuming a length hint method is found, it calls the method and returns it.

This length hint function is used for example in listextend in Objects/listobject.c, which is the implementation of the built-in list type. listextend reads through an iterator, adding the elements to a list. This is obviously used to implement the list.extend() method, but it’s also used when concatenating lists or when instantiating a new list (the empty list is constructed, then the extend code is called).

The length hint is used like this (some details omitted):

    /* Guess a result list size. */
    n = PyObject_LengthHint(b, 8);
    if (n < 0) {/* Deal with bad result */}
    m = Py_SIZE(self);
    if (m > PY_SSIZE_T_MAX - n) {
        /* Overflow. Ignore this. */
    }
    else {
        mn = m + n;
        /* Make room. */
        if (list_resize(self, mn) < 0)
            goto error;
        /* Make the list sane again. */
        Py_SIZE(self) = m;
    }

The key point is the call to list_resize. This resizes the list to be big enough to include its existing contents, plus the hinted size of the iterator. Assuming the hint is correct, there will be no need to resize the list as it fills up. Even if it’s slightly off, it should still require fewer resizings, saving time overall.

Should you bother with this?

The above shows that this really is used, and if you implement a __length_hint__ on your iterators it will save some unnecessary work at runtime. But bear in mind that this is quite an edge case. Saving 10 or 20 allocations won’t make a measurable difference. It will only matter if you’re hitting this code thousands, maybe even millions of times in the hot loop of your algorithm.

My guess is that this is worth doing for any iterator where it’s easy to write code (say 2 or 3 lines) to generate a decent hint. If it’s more than that, wait until you’ve proven that this is something your code is hitting enough times to justify the effort.

An exception to this might be if you’re writing library code to be used by lots of different people. In that case, you can’t know how your code might be used and the payoff of your efforts could be multiplied by the number of users who end up using it. But the more generic your code is, the harder it is likely to be to write a decent length hint method.

Leave a Reply

Your email address will not be published. Required fields are marked *