Intro

As part of my collection of Brainfuck interpreters (which includes, among others, an interpreter written in pure ASM, and others written in C++), I wanted to create one able to generate machine code on the fly, and execute it. Some sort of “handmade” JIT compiler, without relying on any library to achieve it.

Brainfuck is a simple language that contains just a few operators: to read/write a single byte, move the data pointer, increment/decrement and loop. It’s simple enough to easily create machine code equivalents for each, glue them together and execute the final result.

The goal is to convert any Brainfuck program such as the following one into executable x86_64 code, in runtime, in C++:

+++ [> ++++ <-]

(This “multiplies” 3 by 4, and produces a “0” in the first “cell” and a “12” in the second one)

Following the choice made for the C++ version, the memory will be composed of 32 bits cells.

At a high level, this is what’s needed:

  • parse the BF source and obtain the “expressions” tree
  • transform the tree into opcodes/operands in memory
  • make the memory where the opcodes reside ‘executable’ (otherwise, attempting to execute the code will fail, as regular programs shouldn’t be executing code outside of the code sections of the process)
  • jump to the generated code

The full source is here. It works on MacOS, and was tested with Clang 11.0.3 and GCC 10.1.0.

Parsing the BF source

For this part, we can just reuse the existing C++ parser, which includes an extra benefit: “compressing” contiguous identical operators.

In the case of the sample BF program (+++ [> ++++ <-]), the resulting tree looks like this:

[
    Increment(3),
    loop [
        Forward(1),
        Increment(4),
        Backward(1),
        Decrement(1)
    ]
]

This simple optimization avoids executing the same operation N times, executing it just once using N as an ‘operand’.

In a style inspired by the “Visitor” pattern, we’ll traverse the tree and emit opcodes for each.

Generating x86_64 opcodes

All the state we need to keep is the pointer to the current position in the main “memory” that BF programs use. The memory will be a uint32_t buffer, and we’ll use the RDI register for this purpose.

This means that for each BF operator, we need to generate opcodes that update it/use it accordingly.

As a reference, this is the C-equivalent for each BF operator, where ptr is the pointer to the program memory -what we’ll keep in RDI- (adapted from the BF to C translator)

    '>' = "++ptr;"            # forward
    '<' = "--ptr;"            # backward
    '+' = "++(*ptr);"         # increment
    '-' = "--(*ptr);"         # decrement
    '.' = "putchar(*ptr);"    # input
    ',' = "*ptr = getchar();" # output
    '[' = "while(*ptr) {"     # loop start
    ']' = "}"                 # loop end

Moving the pointer forward/backward

These are the simplest ones: we need to increment/decrement the RDI register by N. We can use the RAX register as auxiliary:

  movq    N, %rax
  addq    %rax, %rdi # or subq

Notice we’re using full quad-words, since we’re working with pointers and of course memory addressing utilizes all 64 bits.

The equivalent opcodes (for addq) are:

      48 c7 c0 01 00 00 00         	movq	$1, %rax
      48 01 c7                     	addq	%rax, %rdi

In this case we’re setting 1 to RAX (notice this is x86, so all values all little endian)

We can generate the opcodes on the fly like this:

    virtual void visit(const Forward& fwd) {
        buffer_.writes((uint8_t*)"\x48\xc7\xc0", 3);
        buffer_.writel(fwd.offset()*4);
        buffer_.writes((uint8_t*)"\x48\x01\xc7", 3);
    }

buffer_’s class provides methods to write single bytes, long words (32 bits) or arbitrary-length byte sequences.

Also notice we move the pointer N places by 4, because each cell is 4-bytes long.

Increment/decrement the data pointed-to

These are similar to the previous ones, but dereferencing the pointer. This is what we want to do:

  movq    N, %rax
  addl    %eax, (%rdi) # or subl

Here, since we’re operating on the data, the pointer is still 64 bits, but the operand is 32 (EAX). Opcodes + generation:

    virtual void visit(const Increment& inc) {
        // 0000000000000000 increment:
        //        0: 48 c7 c0 01 00 00 00          movq    $1, %rax
        //        7: 01 07                         addl    %eax, (%rdi)
        buffer_.writes((uint8_t*)"\x48\xc7\xc0", 3);
        buffer_.writel(inc.offset());
        buffer_.writes((uint8_t*)"\x01\x07", 2);        
    }

Input/output

Moving the pointer around and modifying data in memory are fine, but we need to interact with the outside world as well!

Now it starts to get interesting, as we’ll need to execute syscalls to read/write from/to stdin/stdout. This is in fact easier than using the libc for example, as we’ll talk directly with the kernel.

Using the original asm interpreter as reference, this is how the read/write syscalls can be invoked in MacOS:

 # ssize_t read(int fildes, void *buf, size_t nbyte);
 #              rdi         rsi        rdx
 movq $0, %rdi
 movq $1, %rsi
 movq $2, %rdx
 movq $$0x2000003, %rax # SYS_read (3)
 syscall

 # ssize_t write(int fildes, const void *buf, size_t nbyte);
 #               rdi         rsi              rdx
 movq $0, %rdi
 movq $1, %rsi
 movq $2, %rdx
 movq $$0x2000004, %rax # SYS_open (4)
 syscall

The file descriptor goes into RDI, the input/output buffer goes in RSI, the count in RDX and finally the syscall descriptor in RAX. Then we execute the syscall instruction (what was done with int 0x80 in the past)

The problem is that we need to use RDI, but we use it to hold the data pointer. We can easily solve this by pushing RDI to the stack and restoring it after we’re done.

To write a single byte to stdout, this is what we need: (read is similar, only the syscall descriptor -0x2000003- and the fd -0, stdout- change):

  pushq   %rdi              # save RDI
  movq    $0x02000004, %rax # SYS_write
  movq    %rdi, %rsi        # set RDI as the 'buf'
  movq    $1, %rdi          # fd: stdout
  movq    $1, %rdx          # len: 1
  syscall                   # execute the call
  popq    %rdi              # restore RDI

Opcodes + generation:

    virtual void visit(const Output&) {
        // 0000000000000042 write:
        //       42: 57                            pushq   %rdi
        //       43: 48 c7 c0 04 00 00 02          movq    $33554436, %rax
        //       4a: 48 89 fe                      movq    %rdi, %rsi
        //       4d: 48 c7 c7 01 00 00 00          movq    $1, %rdi
        //       54: 48 c7 c2 01 00 00 00          movq    $1, %rdx
        //       5b: 0f 05                         syscall
        //       5d: 5f                            popq    %rdi

        buffer_.writeb(0x57);
        buffer_.writes((uint8_t*)"\x48\xc7\xc0", 3);
        buffer_.writel(0x02000004);
        buffer_.writes((uint8_t*)"\x48\x89\xfe", 3);
        buffer_.writes((uint8_t*)"\x48\xc7\xc7\x01\x00\x00\x00", 7);
        buffer_.writes((uint8_t*)"\x48\xc7\xc2\x01\x00\x00\x00", 7);
        buffer_.writes((uint8_t*)"\x0f\x05", 2);
        buffer_.writeb(0x5f);
    }

Loops

Finally, loops brings a bit more of a challenge. Continuing with the same example, we need to replicate these semantics:

1. +++  = *ptr+=3;
2. [    = while(*ptr) {
3. >    =   ptr++;
4. ++++ =   *ptr+=4;
5. <    =   ptr--;
6. -    =   *ptr-=1;
7. ]    = }
8.      = (END)

In line 2 we want to enter the loop only if the cell pointed by ptr is > 0, otherwise we want to skip it and jump to line 8.

In line 7, we have two options:

  • unconditionally jump back to 2 (which, in case the condition is not true anymore we’ll perform two jumps: 7->2, 2->8)
  • check the condition and continue in case it’s not true, or jump back to 3 (to enter the loop again) in case it is

Option 2 is a bit more efficient, since it avoids some jumps. To compare the pointed data and jump, we need to:

loop_start:
  cmpl    $0, (%rdi)
  je      N          # if (%rdi) == 0, jump out of the loop

loop_end:
  cmpl    $0, (%rdi)
  jne     N          # if (%rdi) != 0, jump back to the 1st instruction

The challenge is to set N in each case, the appropriate jump distance. Mainly for the first jump, as we don’t know where the loop end will be.

The technique is to put a placeholder, push the current position to a stack, continue with the generation until we reach the end of the loop, calculate the distances and fill both jumps.

1. +++  = *ptr+=3;      = movq 3, rax; addl eax, (rdi);
2. [    = while(*ptr) { = cmp 0, (rdi); je [TBD1];
3. >    =   ptr++;      = movq 4, rax; addq rax, rdi;
4. ++++ =   *ptr+=4;    = movq 4, rax; addl eax, (rdi);
5. <    =   ptr--;      = movq 4, rax; subq rax, rdi;
6. -    =   *ptr-=1;    = movq 1, rax; subl eax, (rdi);
7. ]    = }             = cmp 0, (rdi); jne [TBD2];
8.      = (END)         = ret

We want [TBD1] to become ‘8’ (or, in fact, the distance between 8 and the end of 2, since jumps in x86_64 can only be relative, not absolute), and we want [TBD2] to be the distance between the end of 7 and 3.

This is the full solution:

    virtual void visit(const Loop& loop) {
        // 000000000000005e loop_start:
        //       5e: 83 3f 00                      cmpl    $0, (%rdi)
        //       61: 0f 84 00 00 00 00             je  0
        buffer_.writes((uint8_t*)"\x83\x3f\x00", 3);
        buffer_.writes((uint8_t*)"\x0f\x84", 2);
        buffer_.writel(0); // reserve 4 bytes

        // Save current position:
        uint8_t *after_loop_start = buffer_.get_ptr();

        // Recurse into subexpressions:
        for(const auto &child: loop.children()) {
            child->accept(*this);
        }

        // 0000000000000067 loop_end:
        //       67: 83 3f 00                      cmpl    $0, (%rdi)
        //       6a: 0f 85 00 00 00 00             jne 0
        buffer_.writes((uint8_t*)"\x83\x3f\x00", 3);
        buffer_.writes((uint8_t*)"\x0f\x85", 2);
        // Calculate how much to jump back (consider the 4 bytes
        // of the operand itself):
        uint32_t jump_back = after_loop_start - buffer_.get_ptr() - 4;
        // Append the distance to the jump:
        buffer_.writel(jump_back);

        // Now calculate how much to jump forward in case we want
        // to skip the loop, to fill the pending jump:
        uint8_t *after_loop_end = buffer_.get_ptr();
        uint32_t jump_fwd = after_loop_end - after_loop_start;

        // Go back to the original position (-4), fill the pending
        // jump distance and get back to the the current pos:
        buffer_.set_ptr(after_loop_start - 4);
        buffer_.writel(jump_fwd);
        buffer_.set_ptr(after_loop_end);
    }

Marking the memory as executable

For security reasons, dynamic memory is not executable by default. That would prevent our awesome generated code from running! To avoid this problem, we need to mark the pages where the code resides as executable.

We can either allocate some pages and mark them as executable right away, or even better, allocate them and update them later, after the code has been generated. This a good practice, as it guarantees that the pages are not writable and executable at the same time at any moment.

Since we can’t mark any position as executable, we need to mark full pages, we still need to use mmap() to allocate, since the memory allocated will be on a page boundary.

Here we initialize the buffer that will contain the generated code in read/write mode:

    ExecutableBuffer(size_t size) {
        size_ = size;
        buf_ = (uint8_t*) mmap(
            0,
            size_,
            PROT_READ | PROT_WRITE,
            MAP_PRIVATE | MAP_ANONYMOUS,
            -1,
            0
        );
        if (buf_ == MAP_FAILED) {
            perror("mmap");
            exit(1);
        }
        ptr_ = buf_;
        memset(ptr_, 0x00, size);
    }

And when we’re done, we switch to read/exec:

    void make_executable() {
        if (mprotect(buf_, size_, PROT_READ | PROT_EXEC) == -1) {
            perror("mprotect");
            exit(1);
        }
    }

Jump to the generated code

Once we have all generated code in a memory region suitable for execution, we need to jump from our C++ program to it. The first step is to initialize the RDI register, so that it points to a valid uint32_t buffer. We can do that with a bit of inline assembly. Immediately after that, we need to treat the pointer to the generated code as a pointer to a function, and invoke it. That’s it. That will make the processor start executing the opcodes we generated on the fly:

    void run() {
    	// define buffer
        uint32_t memory[30000];

        // set its address in RDI
        asm("movq %0, %%rdi\n":
            : "r"(&memory)
            : "%rdi");

        // enter our code!
        ((void (*)()) program)();
    }

Using LLDB to inspect the memory and the generated code:

Memory before execution:

$ lldb -s lldb-commands.txt ./brainfuck-jit -- test.bf
[...]
(lldb) b JITProgram::run
Breakpoint 1: where = brainfuck-jit`main + 1142 [inlined] JITProgram::run() at brainfuck-jit.cpp:288, address = 0x00000001000021f6
(lldb) r
[...]
(lldb) p &memory
(uint32_t (*)[30000]) $0 = 0x00007ffeefbe2230
(lldb) x $0
0x7ffeefbe2230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
0x7ffeefbe2240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

Generated code:

(lldb) p buf_.buf_
(uint8_t *) $1 = 0x0000000100500000 "H��\x03"
(lldb) x -c 80 $1
0x100500000: 48 c7 c0 03 00 00 00 01 07 83 3f 00 0f 84 2f 00  H��.......?.../.
0x100500010: 00 00 48 c7 c0 04 00 00 00 48 01 c7 48 c7 c0 04  ..H��....H.�H��.
0x100500020: 00 00 00 01 07 48 c7 c0 04 00 00 00 48 29 c7 48  .....H��....H)�H
0x100500030: c7 c0 01 00 00 00 29 07 83 3f 00 0f 85 d1 ff ff  ��....)..?...���
0x100500040: ff c3 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ��..............
(lldb) disas -b -s $1 -c 40
    0x100500000: 48 c7 c0 03 00 00 00  movq   $0x3, %rax
    0x100500007: 01 07                 addl   %eax, (%rdi)
    0x100500009: 83 3f 00              cmpl   $0x0, (%rdi)
    0x10050000c: 0f 84 2f 00 00 00     je     0x100500041
    0x100500012: 48 c7 c0 04 00 00 00  movq   $0x4, %rax
    0x100500019: 48 01 c7              addq   %rax, %rdi
    0x10050001c: 48 c7 c0 04 00 00 00  movq   $0x4, %rax
    0x100500023: 01 07                 addl   %eax, (%rdi)
    0x100500025: 48 c7 c0 04 00 00 00  movq   $0x4, %rax
    0x10050002c: 48 29 c7              subq   %rax, %rdi
    0x10050002f: 48 c7 c0 01 00 00 00  movq   $0x1, %rax
    0x100500036: 29 07                 subl   %eax, (%rdi)
    0x100500038: 83 3f 00              cmpl   $0x0, (%rdi)
    0x10050003b: 0f 85 d1 ff ff ff     jne    0x100500012
    0x100500041: c3                    retq

Memory AFTER execution:

(lldb) x $0
0x7ffeefbe2230: 00 00 00 00 0c 00 00 00 00 00 00 00 00 00 00 00  ................
0x7ffeefbe2240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................

The first 4 bytes (the 1st cell) are 00 00 00 00, and the next 4 (2nd cell) are 0c 00 00 00, 0x0c in little endian (12), just as expected.

Benchmark

Heavy BF programs like the Mandelbrot Set generator, or the prime numbers generator run much faster compared to the optimized interpreters:

  Primes up to 200 Mandelbrot Set
C 3878ms 14619ms
Rust 4123ms 7270ms
C++ 2827ms 5272ms
C++-JIT 644ms 1157ms