InternalsDictionaries preserve ordering

Dictionaries preserve ordering

I came across something unexpected when preparing my post about the basics of dictionaries in Python. For as long as I’ve been using Python, it’s been a fundamental truth that dictionaries don’t have a reliable ordering. It’s mentioned all over StackOverflow. The standard library even has an OrderedDict container for cases where you want your dictionary to have order.

Turns out it isn’t so.

I was trying to create an example that shows the fundamental way that dictionaries return elements in an order that has nothing (apparently) to do with the keys or the order in which you inserted them. Something like this:

fruits = {}
fruits["apple"] = 100
fruits["banana"] = 200
fruits["cherry"] = 300

I printed the dictionary. The fruits came back in alphabetical order. Dammit.

I guess I had bad luck and the keys just happened to be in the naive order. It’s a 1 in 6 chance, right?

fruits["damson"] = 400

Still in insertion order. Maybe something is up.

This actually isn’t anything mysterious: it’s right there in the documentation:

Changed in version 3.7: Dictionary order is guaranteed to be insertion order. This behavior was implementation detail of CPython from 3.6.

I’m on Ubuntu, so I’m getting Python 3.6 as my default Python 3. Mystery solved.

I don’t know about anyone else, but I was really surprised to see this change. I was assuming that making things more user-friendly like this would only come at the cost of making things slower, and there’s no way that Python would make things slower across the board.

Why wasn’t it always this way?

To recap some of the basics: Dictionaries don’t preserve ordering because they’re implemented using hash tables. A hash table is a data structure that gives very fast lookup regardless of the number of items in the container. It works like this:

Start with a so-called “hash function” that maps strings (or whatever key value you’re using—I’ll keep it simple here) into integers. To insert a value, apply the function to the key and then store the value in an array at that numbered position. Retrieving it again works the same way. The hash function is quick, and fetching an item from a specified position in an array is trivial. The great thing about it is that it’s equally fast no matter how many other elements are stored in the hash table (I’m skipping over a few details here, to keep things simple).

To be concrete, suppose:

We can store these entries in an 8-element array like this:

We can now easily look up any of the values from its key: hash the key, look up the corresponding position in the array, check that the key matches what we expect and then return the value.

But note that there’s no way to produce the keys in alphabetical order, without looking up all the keys and then sorting them (which is a costly operation). Nor can we get the keys in insertion order, because nothing about the structure shows us the order they were inserted; the image above would look the same whatever order you inserted the three values.

So why did it change?

One of the downsides of a hash function can be seen in the diagram above: generally there’s a bunch of empty space in it. This empty space isn’t zero cost: this is an array, meaning that it’s all one block of memory and the empty spaces are blank memory just taking up space. There are two ways to limit this overhead: Pick a clever hash function that maps the keys as densely as possible into as small a space as possible, and limit the size of each element in the array. In Python 3.6, a change was made to achieve the latter.

Looking a little deeper at how Python actually uses hash tables: Firstly, the has function isn’t as simple as I suggested above. The hash function produces a 64-bit integer, which is then used to identify an entry in the array (by using 3 bits of the integer, if we’ve got an 8-entry array). This is done because if enough data is inserted into the hash table it will have to be resized, and it’s convenient if we use the same hash values in the new bigger table. For the same reason, the hash values are stored in the array and not discarded as I suggested above. This means that our hash table ends up looking like this:

There are 3 items in each array entry, so our wasted space is actually 24 bytes for each empty array entry. Can we improve on this?

Yes we can. Instead of the array holding the hash value, key and value make the array store an index into another array:

It might look like we’ve just moved the data around, but note that this new array is dense: there is no wasted space in it between elements. Each newly inserted element gets added on to the end of this array. Now we only have 8 bytes of space used per empty entry in the hash table, rather than 24. On the other hand, we’ve added 8 bytes of space per filled entry in the hash table. But this is probably a win overall for size.

We still have all the desirable hash table properties, and the only cost is that we have to follow another pointer lookup to find the value we want. But on a modern CPU following pointers is cheap provided they are stored in cache, and data is much more likely to be stored in cache if it is small. So making the data structure smaller at the cost of another pointer lookup is likely to make things faster as well as smaller.

Wait, what does this have to do with preserving order?

Usually when we make code faster we end up making it less useful, or at least less flexible. But in this case something nice falls out for free. Note that the new array the data is stored in will automatically be in insertion order. If we want to iterate over the elements of the dict, we don’t need to look at the first table, we can just walk over the insertion-order table and produce the keys in insertion order.

For more information see this thread from python-dev way back in 2012 when the change was first proposed.

Does this mean that collections.OrderedDict is now obsolete?

Good question. Actually, no. The thing with an OrderedDict is that ordering is assumed to have meaning over and above the values stored. The most obvious way this comes into play is in comparison between dictionaries: if you compare two OrderedDict instances for equality, they only compare equal if the keys are the same and the insertion order is the same. For regular dict we don’t want this to be the case: two dicts with different insertion order but the same keys and values are the same dict. Even if we did want to change this we couldn’t, because it would break existing code.

It might be that dict and OrderedDict can share all their implementation internals and just have a flag set for “ordering is significant for equality test” or whatever. I’ll admit I haven’t looked into this.

Can I rely on this behaviour?

There’s a big difference between behaviour that happens on a certain version of Python and behaviour you can rely on in your code. As the docs quoted above say, as of Python 3.7 this is an official specification and not just an implementation detail. For more information about why this is specified, see this post from Guido.

Categories: Internals Tags: ,

Comments

No Comments Yet. Be the first?

Post a comment

Your email address will not be published. Required fields are marked *