Underhanded Python: giving the debugger the wrong line numbers

In my last post I showed how it’s easy for Python code to detect the presence of the debugger and change its behaviour accordingly. But this did nothing to deal with single-stepping through the code in the debugger: the debugger would show execution hitting the debugger-detection code and changing its behaviour. You could probably disguise this code a bit, but it’s hard to avoid giving the impression that something suspicious is going on.

Luckily we have another tool at our disposal: we can lie to the runtime about our line numbers.

How single-stepping works in the Python debugger

Executing Python is a 3-stage process:

The first stage is the only part that is working with the source code, and debugging takes place at the execution phase. Therefore at each stage the line numbers need to be preserved in some way to be passed to the next phase.

You can see this when you parse the code. Code like this:

a = 1 + 1

gets converted to an AST like:

a = 1 + 1

You can inspect this AST and see the line numbers:

>>> import ast
>>> code = "a = 1 + 1"
>>> tree = ast.parse(code)
>>> print(tree.body[0].lineno)
1

When this AST is compiled into bytecode, the line numbers get attached to the bytecode. The details are a bit complicated because Python tries to make the bytecode as compact as possible, but conceptually there’s a mapping from each bytecode instruction to the corresponding line of the source code it originally came from.

Of course, each line the source code will typically end up generating many bytecode instructions, because individual bytecode instructions are so much less powerful than Python code (that’s kind of the point of it). So you end up with a run of bytecode instructions that correspond to the same line number. The disassembler shows you this:

import dis

source = """
def add(a, b):
    print("Adding")
    return a + b
"""

code = compile(source, 'foo.py', 'exec')

dis.dis(code)

produces:

  2           0 LOAD_CONST               0 (<code object add at 0x7f6ceef89810, file "foo.py", line 2>)
              2 LOAD_CONST               1 ('add')
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (add)
              8 LOAD_CONST               2 (None)
             10 RETURN_VALUE

Disassembly of <code object add at 0x7f6ceef89810, file "foo.py", line 2>:
  3           0 LOAD_GLOBAL              0 (print)
              2 LOAD_CONST               1 ('Adding')
              4 CALL_FUNCTION            1
              6 POP_TOP

  4           8 LOAD_FAST                0 (a)
             10 LOAD_FAST                1 (b)
             12 BINARY_ADD
             14 RETURN_VALUE

where the left-hand numbers are the source code lines.

This brings us back to sys.settrace. When you register a trace function in Python, you can arrange for your trace function to be called on various events. For our purposes, we care about:

  • call: entering a function or code block
  • line: a new line of code is about to be executed, i.e. the next bytecode instruction is associated with a different line number than the previous one.

The practicalities are a bit fiddly, but essentially this gives us what we need for a single-step debugger: a function gets called on each new line, and that function can suspend execution and allow the system to be inspected.

How we can lie to the debugger

The debugger is reliant on the information attached to the function object. That information is readable like any other Python object property:

def add(a, b):
    print("adding")
    return a + b

print(add.__code__.co_firstlineno)
print(add.__code__.co_lnotab)

The co_firstlineno is simply the source line of the start of the function, and the co_lnotab is a table mapping bytecode positions to source code positions. In our case, we get:

1
b'\x00\x01\x08\x01'

The second line works like this:

  • First byte of bytecode (00) corresponds to 1 line after the beginning of the block (01)
  • 8th byte of bytecode (08) corresponds to 1 line after the previous line (01)

Let’s suppose we have two functions: Our malicious function, and a decoy function that we want the user to see when they single-step through it:

import sys

def add(a, b):
    return a + b

def bad_add(a, b):
    if sys.gettrace() is None:
        print("Doing something malicious")

    return a + b

You might hope that you could just overwrite bad_add‘s line number information:

bad_add.__code__.co_linenumber = add.__code__.co_linenumber

But this doesn’t work, because this is a read-only field in Python. However, what you can do is create a new function object that has all the same properties as the other function except for the ones we want to rewrite. You create a new function object with types.CodeType. However, the interface is a bit fiddly and as of Python 3.8 you can do this much more easily by calling the replace method on the old code object, which creates a copy:

add.__code__= bad_add.__code__.replace(
    co_lnotab=add.__code__.co_lnotab,
    co_firstlineno=add.__code__.co_firstlineno
)

This is replacing the entire add code object with a newly created one. It only replaces the code of add, it doesn’t change the other properties of it (such as the __name__). In other words, add will continue to look like the old add but it will walk and quack like bad_add.

But the clever bit is that the new code object is a hybrid: it has bad_add‘s behaviour, but add‘s line number information. If you try to step through it in the debugger, the debugger will show you (line by line) the source code of add even as it executes bad_add.

This was pretty trivial to do, but to be fair it doesn’t actually work all that well. The source code mappings for add aren’t really a good substitute for those for bad_add, because they have different bytecode. In our case we get away with it because add is pretty trivial, but in real code this would probably lead to odd behaviour: single-stepping would cause the debugger to leap around the function in odd ways that didn’t correspond to the code, and it would be pretty obvious that something was wrong.

However, for a motivated attacker it wouldn’t be that hard to create a more accurate fake source code mapping, which would make this much harder to track.

Of course, this code is still pretty obviously malicious if someone looks at the right piece of source code. The malicious piece can be hidden somewhere well away from the code that someone would be likely to inspect (in a different module, even) but once it’s spotted it’s pretty obvious that it’s doing something underhanded. I have a few thoughts about how this could be addressed, which I’ll talk about next time.

Leave a Reply

Your email address will not be published. Required fields are marked *