If you’re writing serious Python you’re hopefully testing it, and if you’re doing any serious testing you are hopefully measuring how much code coverage you have.

But how does code coverage measurement actually work? If I execute a test function and get back a result, how can I possibly tell which lines of code were executed to get that result? And can this be done in pure Python, or is compiler magic involved?

Our goal is to end up with something like this:

This tells us that this code has been pretty well tested, but not on Python 3. In order to generate this sort of thing, there are actually two steps:

  • The tests run, and while they run they dump out information about which lines are being executed. The aim here is to dump out the necessary information as quickly as possible, without making the program run any slower than it has to.
  • After the tests finish, the output from the first step is filtered and collated into a format that is understandable by human beings.

The first step is the only bit I’m interested in here. The second step has its own challenges, but doesn’t need any particular access to the Python internals to do its job.

Really what we need is our code to do something like:

  1. gadget = widget.load(); report_line_hit('mymodule.py', 1)
  2. if gadget.fancy:
  3.     gadget.render_decorations(); report_line_hit('mymodule.py', 3)
  4. else:
  5.     pass; report_line_hit('mymodule.py', 5)

We have report_line_hit dump these events out to a file, and then we can go through the file later and figure out which lines were hit.

In C and other compiled languages we have to achieve this by compiling the code in a special mode. This produces a different version of the executable that performs the same operations but also logs out information as it goes along. There are at least two downsides to this:

  • The code has to be compiled specially. This means the build scripts need to know about coverage testing, and the build takes twice as long because everything needs to be built twice (or you need to do an extra coverage build on-demand, which developers never bother to do and thus don’t do coverage testing often enough)
  • The code being tested is not identical with the normal production code. This usually shouldn’t matter, but it sometimes introduces subtle problems.

Python doesn’t have an ahead-of-time compile step like C or Java, but it is still compiled on-demand as each file gets loaded. If you’ve read my previous series on assertion rewriting, you might be wondering whether we can do this by rewriting the code on the fly. The answer is that you probably can, but that’s not how it’s done in practice. If we did have some library that rewrote Python as it was loaded to add coverage annotations, we’d solve the first of the problems above but not the second.

Modifying code on the fly is actually a really bad idea, because of quite how much work needs to be done and quite how difficult it would be to get everything right. Pytest only modifies assert statements, and only in the test code. If we wanted to do coverage the same way we’d have to modify every line of the program, and we’d be modifying product code as well as test code.

Luckily we don’t need to modify the code, because Python exposes a hook that does exactly what we want: sys.settrace. Calling this gives the Python interpreter a function that we want it to call every time a new line of code is executed. The interface is actually a little fiddly to get right, but this is the major problem sorted.

So can we write our own pure-Python coverage tester in a handful of lines of code? The answer is yes, although will be very slow. You also need to be careful that whatever code you write for logging coverage doesn’t end up triggering a load of further function calls that trigger even more calls to the coverage tester. I might attempt to do this in a future blog post.

Real-world coverage testing like coverage.py uses a coverage logging callback written in C to make it much faster.