Exploring Python dis

As you might know, CPython, the most commonly used implementation of Python, uses a stack based virtual machine to run Python scripts. This means that Python script ASTs are never directly executed. Instead, CPython compiles .py script files into byte code files first (.pyc files). This allows for some optimizations during the byte code compilation step and shipping of byte code files to end customers instead of raw scripts. This is useful to make reverse engineering hard. Ableton Live includes a CPython interpreter for its MIDI hardware integration. It does not let users look at the actual hardware integration scripts though. Instead, all hardware integration scripts can only be found in their compiled .pyc form.

So what exactly can be found in CPython byte code? Byte code contains the stack instructions for CPython’s virtual machine. Enter the Python disassembly library dis. It is included in the CPython standard library and allows developers to view and understand the internal bytecode representation of code objects in CPython.

Other implementations of Python typically do not use an internal byte code representation. Therefore, the following is highly CPython specific.

First, let’s get familiar with code objects. Code objects can be created from any executable series of Python statements. For example, if we enter the following in a REPL

def show_weather():
    print("The weather is great!")

we not only define a function show_weather, but a CPython code object as well. But first, we need to check the output of our newly defined function.

The weather is great!

Now that the we know that the weather is great, we should take a look at the code object. A CPython code object can be accessed through the __code__ attribute of a function.

Before even using the dis library, we can actually peek at the byte code of our show_weather function.

code_object = show_weather.__code__ # __code__ contains the function internals
code = code_object.co_code

That does not look very helpful. Thankfully, the dis module has an opname list, that maps byte code instructions (opcodes) to human readable names, similar to assembler instruction mnemonics such as MOV, ADD, etc.

We need to keep in mind though, that we cannot just run all bytes through the opname list, as some instructions can require attributes. Attributes usually are numbers that refer to local and global names. Therefore, not all bytes in a byte code object represent byte code instructions.

Let’s try it out. We will see if there is a problem.

from dis import opname
for byte in code:
    print('Byte {} -> opcode {}'.format(byte, opname[byte]))
Byte 116 -> opcode LOAD_GLOBAL
Byte 0 -> opcode <0>
Byte 0 -> opcode <0>
Byte 100 -> opcode LOAD_CONST
Byte 1 -> opcode POP_TOP
Byte 0 -> opcode <0>
Byte 131 -> opcode CALL_FUNCTION
Byte 1 -> opcode POP_TOP
Byte 0 -> opcode <0>
Byte 1 -> opcode POP_TOP
Byte 100 -> opcode LOAD_CONST
Byte 0 -> opcode <0>
Byte 0 -> opcode <0>
Byte 83 -> opcode RETURN_VALUE

That looks interesting. We immediately notice the occurrence of <0> opnames. Since a lot of the CPython opcodes require arguments to be specified, not all bytes in our bytecode represent opcodes, but references to values. In this case <0> is part of a numeric reference to a constant or global value. POP_TOP is part of the reference number as well, but we naively interpret it as an opcode.

If we want to know which values our function is referring to, we have to dig into the code object even more. In this particular example we are interested in constants and name references. You will see why.

def pformat(tpl):
    return " \n".join("{}: {}".format(n, val) for n, val in enumerate(tpl))

print("Constant values:")

Constant values:
0: None
1: The weather is great!
0: print

Excellent. We were able to retrieve the reference to print and the string constant The weather is great!. Let’s see whether we’re able to match the bytecode references with our references.

Luckily, the dis module has a few lists of byte codes called hasname, hasconst, hasnargs and similar, that contain all the opcodes that require name, constant or other arguments. Since our code only has constants, global values, and function calls, we can go on without caring about other types of arguments. Function calls require a 2 byte argument that specifies the amount of keyword parameters (significant byte) and the number of positional parameters (least significant byte) that have been placed on the stack.

from dis import hasname, hasconst, hasnargs

code_listing = list(code) # copy the code as we are going to pop bytes off
while code_listing:
    byte = code_listing.pop(0)
    addendum = ''
    if byte in hasname:
        index = code_listing.pop(0) << 1 + code_listing.pop(0)
        addendum = '(name {})'.format(repr(code_object.co_names[index]))
    elif byte in hasconst:
        index = code_listing.pop(0)
        addendum = '(const {})'.format(repr(code_object.co_consts[index]))
    elif byte in hasnargs:
        nargs = code_listing.pop(0)
        nkwargs = code_listing.pop(0)
        addendum = '({} kw params, {} positional params)'.format(
            nkwargs, nargs)
    print('{} \t{}'.format(opname[byte], addendum))
LOAD_GLOBAL 	(name 'print')
LOAD_CONST 	(const 'The weather is great!')
CALL_FUNCTION 	(0 kw params, 1 positional params)
LOAD_CONST 	(const None)

Holy cow! A working CPython disassembler! I doubt that the CPython developers are be able to come up with something like that!

from dis import dis
2           0 LOAD_GLOBAL              0 (print)
            3 LOAD_CONST               1 ('The weather is great!')
            6 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
            9 POP_TOP
           10 LOAD_CONST               0 (None)
           13 RETURN_VALUE

Never mind. There already is a disassembly method.

Back to work.

Date created:
17 Oct 2015

You are more than welcome to share your thoughts via email