Dynamic Translation for Fast Interpreters

CS 641 Lecture, Dr. Lawlor

If there's one single operation that can totally derail the modern microprocessor, it's the conditional branch instruction. You can keep pipelines full if, and only if, you can predict exactly where every branch is going to land.

That's hard.

It's especially hard when code keeps going in different directions. For example, in the core of any interpreter, you have to look at what the script is telling you to do, and do it:

	switch (the_next_instruction[instruction_counter++]) {
	case instruction_1: ... do instruction 1 ... break;
	case instruction_2: ... and so on ...
	}

How can the branch predictor do its job, when the code itself doesn't know what instruction is coming next? It can't, and so the pipeline has to be flushed and re-flushed, and interpreter performance is quite bad.

Enter dynamic binary translation: the interpreter reads the script and writes the corresponding machine code at runtime. Generally you write the code once, then execute it repeatedly. Because the CPU is seeing ordinary machine code, performance is much better--often very competitive with a compiled language, if you leave aside things like memory allocation and garbage collection.

The Java VM dynamically translates Java bytecode. They call this "Just-In-Time compilation", or JIT.
Microsoft does it to Common Language Runtime (CLR) bytecode, used for .NET.
Google does it in their fast JavaScript "V8" engine (see below).
Haskell does it in the compiled version of that language.
Lots of companies do this to run code for one processor on another processor. Apple's done this twice: during the transition from 68000 to PowerPC (the "68k emulator"), and now during the transition to x86 ("Rosetta").

An Example

Here's a little bit of C++ that calls some assembly to do work on floats:

extern "C" float do_work(float x,float y);

int foo(void) {
	float r=do_work(1.7,1000.0);
	std::cout<<"Return value = "<<r<<"\n";
	return 0;
}

(Try this in NetRun now!)

Here's the corresponding assembly:

global do_work
do_work:

addss xmm0,xmm1
ret

(Try this in NetRun now!)

That assembly code is translated into these 5 bytes of machine code:

0000000000000000 <do_work>:
   0:	f3 0f 58 c1          	addss  xmm0,xmm1
   4:	c3                   	ret

Here's an idea: let's skip the assembler, and set up those bytes ourselves!

typedef float (*work_fn)(float x,float y); // function pointer prototype
const char work_mc[]={
	0xf3,0x0f,0x58,0xc1, //	addss  xmm0,xmm1
	0xc3                 //	ret  
};

int foo(void) {
	work_fn f=(work_fn)work_mc; // make machine code array into callable function
	
	float r=f(1.7,1000.0);
	std::cout<<"Return value = "<<r<<"\n";
	return 0;
}

(Try this in NetRun now!)

Note that "work_mc" starts out as just an array of bytes, but we typecast it into a function pointer, and then run those bytes as machine code!

Generalize Across Operations

It's pretty easy to figure out the machine code for all the operations you want. A disassembler makes this easy:

0f 58 c0             	addps  xmm0,xmm0
0f 59 c0             	mulps  xmm0,xmm0
0f 5c c0             	subps  xmm0,xmm0
0f 5e c0             	divps  xmm0,xmm0
0f 5d c0             	minps  xmm0,xmm0
0f 5f c0             	maxps  xmm0,xmm0
0f 51 c0             	sqrtps xmm0,xmm0
0f 52 c0             	rsqrtps xmm0,xmm0
0f 53 c0             	rcpps  xmm0,xmm0
0f 54 c0             	andps  xmm0,xmm0
0f 55 c0             	andnps xmm0,xmm0
0f 56 c0             	orps   xmm0,xmm0
0f 57 c0             	xorps  xmm0,xmm0
0f 28 c0             	movaps xmm0,xmm0

0f 58 c0             	addps  xmm0,xmm0
0f 58 c1             	addps  xmm0,xmm1
0f 58 c2             	addps  xmm0,xmm2
0f 58 c3             	addps  xmm0,xmm3

0f 58 c0             	addps  xmm0,xmm0
0f 58 c8             	addps  xmm1,xmm0
0f 58 d0             	addps  xmm2,xmm0
0f 58 d8             	addps  xmm3,xmm0

f3 0f 58 c0          	addss  xmm0,xmm0
0f 58 c0             	addps  xmm0,xmm0
f2 0f 58 c0          	addsd  xmm0,xmm0
66 0f 58 c0          	addpd  xmm0,xmm0

(Try this in NetRun now!)

This is a pretty simple table, so we can even put together a totally new SSE function from these instructions:

#include <sys/mman.h>

typedef float (*do_work_fn)(float x,float y); 
/* Make this region of memory executable, using "mprotect" */
void executable_region(void *start_ptr,void *end_ptr) {
	long start=(long)start_ptr, end=(long)end_ptr;
	start&=~0xfff; end+=0xfff; end&=~0xfff; /* round to page boundaries */
	if (0!=mprotect((void *)start,end-start,PROT_READ|PROT_WRITE|PROT_EXEC))
		std::cout<<"Unable to mprotect between pointers "<<start<<" and "<<end<<".\n";
}

// Lists x86 SSE instructions' machine code.  Format is F3 0F <this byte> <modrm>
typedef enum {
	addss=0x58, subss=0x5C,
	mulss=0x59, divss=0x5E,
	minss=0x5D, maxss=0x5F,
	sqrtss=0x51, rsqrtss=0x52, rcpss=0x53
} sse_instruction_type;

int foo(void) {
	std::vector<char> mc;
	std::string line;
	while (std::getline(std::cin,line)) {
		int D=line[0]-'0', S=line[3]-'0';
		mc.push_back(0xf3);  // one float (ss)
		mc.push_back(0x0f); // SSE
		switch (line[1]) {
		case '+': mc.push_back(addss); break;
		case '-': mc.push_back(subss); break;
		case '*': mc.push_back(mulss); break;
		case 'm': mc.push_back(minss); break;
		case 'M': mc.push_back(maxss); break;
		default: std::cout<<"OHNO!!!  Unrecognized line "<<line<<"\n";
		}
		mc.push_back(0xC0+(D<<3)+S);   // xmmD,xmmS
	}

	mc.push_back(0xc3); // return

	executable_region(&mc[0],&mc[mc.size()-1]);

	do_work_fn f=(do_work_fn)&mc[0];
	float r=f(1.7,10.0);
	std::cout<<"Return value = "<<r<<"\n";
	return 0;
}

(Try this in NetRun now!)

function_maker class

Here's a more object-oriented example program:

typedef float (*work_fn)(float x,float y); // function pointer prototype

// Lists x86 SSE instructions' machine code.  Format is F3 0F <this byte> <modrm>
typedef enum {
	addss=0x58, subss=0x5C,
	mulss=0x59, divss=0x5E,
	minss=0x5D, maxss=0x5F,
	sqrtss=0x51, rsqrtss=0x52, rcpss=0x53
} sse_instruction_type;

// Creates a new machine code function at runtime, instruction by instruction
class function_maker {
public:
	function_maker() {has_ret=false;}

	// Add this instruction to our function.  src and dest are register numbers
	void instruction(int dest,sse_instruction_type opcode,int src) {
		mc.push_back(0xF3); mc.push_back(0x0F); // SSE-float prefix bytes
		mc.push_back(opcode);
		int modrm=(3<<6) | (dest<<3) | (src);
		mc.push_back(modrm);
	}
	
	// Run the new function
	float run(float x,float y) {
		if (!has_ret) { // we need to add a return instruction before running
			has_ret=true;
			mc.push_back(0xc3); /* "ret" instruction */
		}
		unsigned char *mc_data=&mc[0]; // get a pointer to machine code
		work_fn f=(work_fn)mc_data; // make executable
		return f(x,y); // run the new function
	}
private:
	std::vector<unsigned char> mc; /* new function's machine code */
	bool has_ret; /* did we add the "ret" instruction yet? */
};

int foo(void) {
	function_maker f;
	f.instruction(0,addss,1);
	float x=f.run(1.7, 1000.0);
	f.print();
	std::cout<<"Function returned "<<x<<"\n";
	return 0;
}

(Try this in NetRun now!)

Depending on the machine's configuration, this may crash instead of running your new function, because std::vector's heap memory is not marked as being executable (the "No eXecute" or NX bit is set). We can change the heap's memory permission on a per-page basis using "mprotect", a function that annoyingly only works on page aligned pointers. We'll talk about the page table stuff later.

#include <sys/mman.h> /* Needed to work around "No eXecute" (NX) heap memory */

...
		/* Work around NX bit, by marking entire page as executable */
			long mc_start=(long)&mc[0], mc_end=(long)&mc[mc.size()-1];
			mc_start&=~0xfff; mc_end=(mc_end+0xfff)&~0xfff; /* round to page size */
			mprotect((void *)mc_start,mc_end-mc_start, 
				PROT_READ+PROT_WRITE+PROT_EXEC);
...

(Try this in NetRun now!)

Here's a way to use function_maker to translate a simple line-by-line script into machine code:

int foo(void) {
	function_maker f;
	std::string line;
	while (std::getline(std::cin,line)) 
	{ /* Format of lines:   
			rD?=rS; character
			0123456 array index
	*/
		sse_instruction_type op; /* which operation to perform */
		switch (line[2]) {
		case '+': op=addss; break;
		case '*': op=mulss; break;
		case 'v': op=sqrtss; break;
		default: std::cout<<"Unknown operation "<<line<<"!\n"; return 0;
		}
		f.instruction(line[1]-'0', op, line[5]-'0');
	}
	f.print();
	std::cout<<"Running newly created function...\n";
	float x=f.run(1.7, 10000.0);
	std::cout<<"Function returned "<<x<<"\n";
	return 0;
}

(Try this in NetRun now!)

The bottom line is that creating a function's machine code at runtime isn't actually that hard!

x86 Details: ModR/M byte

This byte is just a bit field giving a selector called "mod" (which indicates whether r/m is treated as a plain register or a memory address), one register called "reg/opcode" (which is usually the destination register, and determines the column in the ModR/M table), and a final register called "r/m" (usually the source register, which selects the row of the ModR/M table). These are stored in a single byte as shown below.

mod	reg/opcode	r/m
2 bits, selects memory or register access mode: 0: memory at register r/m 1: memory at register r/m+byte offset 2: memory at register r/m + 32-bit offset 3: register r/m itself (not memory)	3 bits, usually a register number Usually the destination (but for dyslexic instructions, it's the source register; or occasionally even an extra opcode)	3 bits, register number Usually defines the data source (except for dyslexic instructions) Treated as a pointer for mod!=3, treated as an ordinary register for mod==3. If r/m==4, indicates the real memory source is a SIB byte.

ModR/M byte encoding. See "ModR/M" table for meaning of values. ModR/M is much easier to write in octal, not hex, since the 3-bit fields match exactly with octal digits. You can write octal in C/C++ with a leading 0, so "0301" is octal for 011 000 001 (binary) or 0xC1 (hex).

x86 register numbering is about as bizarre as you've come to expect:

Register	eax	ecx	edx	ebx	esp	ebp	esi	edi
Number	0	1	2	3	4	5	6	7

For example, a ModR/M byte like:

ModR/M==0005 (in octal) is composed of mod==0 (so accessing memory), reg/opcode==0 (meaning the normal-destination register is eax, register 0), and r/m==5 (meaning the source is memory pointed to by ebp, register 5).
ModR/M==0312 (octal again) means mod==3 (so we're only touching registers, not memory), reg/opcode==1 (meaning destination is ecx, register 1), and r/m==2 (meaning the normal-source register is edx, register 2)

The saddest part of the ModR/M byte is that there are sometimes two forms of a given instruction. The "normal" versions read from r/m and write to reg/opcode. The "dyslexic" versions do the exact opposite--they read from reg/opcode, and write to r/m. Normal and dyslexic are my own terminology; you won't find these names elsewhere. Normal versions can hence read from memory or a register, but can only write their results to a register. Dyslexic versions are the other way around: they can't read from memory (only registers), but can write to registers or memory.

For example, there are two flavors of 32-bit add: "0x03" is normal, "0x01" is dyslexic. Hence:

For "0x03 0312", we add the "r/m"-listed register 2 (edx) to the "reg/opcode"-listed register 1 (ecx).
For "0x01 0312", we add the "reg/opcode"-listed register 1 (ecx) to the "r/m"-listed register 2 (edx).

This means "0x03 0312" and "0x01 0321" are two different but totally equivalent ways to write "add ecx,edx".

Machine Code	Assembly	C
0x03 0312	add ecx,edx	c+=d;
0x03 0012	add ecx,[edx]	c+=*d;
0x01 0012	add [edx],ecx	*d+=c;
0x01 0312	add edx,ecx	d+=c;

Normal and dyslexic versions of exist for all the ancient instructions: ADD (0x03 and 0x01), SUB (0x2B and 0x29), MOV (0x8B and 0x89), CMP (0x3B and 0x39), AND (0x23 and 0x21), etc.

Only normal versions exist for the SSE instructions used above (0xF3 0x0F 0x50-0x5F). For example,

0xf3, 0x0f, 0x58, 0301

"0xf3 0x0f 0x58" is the "add SSE scalar single" opcode, "addss". "0xc1" is the ModR/M byte, which from the "ModR/M" table gives mod==3 (a register-register operation), reg/opcode==0 (the destination register is xmm0), and R/M==1 (the source register is xmm1). So all together this is "addss xmm0,xmm1".

Production Code: Dynamic Translation inside Google's V8

Google's dynamic translation javascript engine is called V8. You can grab the code and build it on a UNIX machine like this:

svn co http://v8.googlecode.com/svn/trunk/ v8
cd v8
svn co http://gyp.googlecode.com/svn/trunk build/gyp
make

A typical chunk of dynamic translation generation code is in src/ia32/full-codegen-ia32.cc:

    Label ok;
    __ test(ecx, ecx);
    __ j(zero, &ok, Label::kNear);
    // +1 for return address.
    int receiver_offset = (info->scope()->num_parameters() + 1) * kPointerSize;
    __ mov(ecx, Operand(esp, receiver_offset));
    __ JumpIfSmi(ecx, &ok);
    __ CmpObjectType(ecx, JS_GLOBAL_PROXY_TYPE, ecx);
    __ j(not_equal, &ok, Label::kNear);
    __ mov(Operand(esp, receiver_offset),
           Immediate(isolate()->factory()->undefined_value()));
    __ bind(&ok);

There are several interesting things happening here:

Label is a C++ object, here used with their pseudo-operation "bind" to easily jump to a forward reference.
"__" is a macro, expanding to access methods from an Assembler object ("__" becomes "masm->")
The Assembler object (src/ia32/assembler-ia32.h) has methods for each x86 machine code instruction.
ecx is yet another object, here a Register object initialized with the ModR/M code for that register.

The nice part about this is most of the outer conditionals get folded out of the generated machine code, resulting in much better branch performance.

There are tricky things about this, however. Typically machine code has full access to the machine, introducing the possibility of a JavaScript stack smashing attack, for example. For untrusted code, or to retain the fail-soft nature of most interpreters, you need to either insert runtime array indexing checks (which makes things run slower), or use a hardware trick like guard pages.