Instruction Encoding & Frequency

From the Literature

William Dally has a good presentation on power usage.
Conservation Cores: G. Venkatesh et. al., "Conservation Cores: Reducing the Energy of Mature Computations," Hot Chips, September 2010. (See also gizmag popular press version.)

CISC vs RISC vs VLIW

Back in the 1970's, when bits were expensive, the typical CPU encoding used exactly as many bytes as each instruction needed and no more. For example, a "return" instruction might use one byte (0xc3), while a "load a 32-bit constant" instruction might use five bytes (0xb8 <32-bit constant>). These variable-sized instructions are (retroactively) called Complex Instruction Set Computing (CISC), and x86 is basically the last surviving CISC machine.

During the 1980's, folks realized they could build simpler CPU decoders if all the instructions took the same number of bytes, usually four bytes. This idea is called Reduced Instruction Set Computing (RISC), and was built into MIPS, PowerPC, SPARC, DEC Alpha, and other commercial CPUs. Here's a good but long retrospective article on the RISC-vs-CISC war, which got pretty intense during the 1990's. Nowadays, RISC machines might compress their instructions (like CISC), while CISC machines usually decode their instructions into fixed-size blocks (like RISC), so the war ended in the best possible way--both sides have basically joined forces!

During the late 1980's and early 1990's, several companies created even longer instruction machines, called Very Long Instruction Word (VLIW), where basically each part of the CPU has corresponding bits in every instruction. This makes for very simple decoding, and allows some interesting parallel tricks, but each instruction might be a hundred bytes or more! Modern graphics processors are typically VLIW internally, and there are several strange digital signal processor chips that are VLIW, but the concept hasn't really caught on for the main CPU. And any company that produced a successful VLIW chip would have a big problem building an improved processor, since each instruction specifically describes what should happen on each part of the old chip.

Assembly Basics

So here's some assembly language:

        Machine Code:           Assembly Code:
Address                    Instruction Operands
   0:	55                   	push   ebp
   1:	89 e5                	mov    ebp,esp
   3:	b8 07 00 00 00       	mov    eax,0x7
   8:	5d                   	pop    ebp
   9:	c3                   	ret

Address. A byte count, indicating where you are in a big list of bytes. The first byte has address zero. An address can be thought of as an array index into a big array of chars.
Machine Code. A set of bytes that the CPU can treat as instructions and execute. For example, the byte "0xC3" tells the CPU to execute the instruction "ret" (return from function). Human beings can write machine code (just wait for HW3!), but usually humans write Assembly Code instead.
Assembly Code. The human-readable counterpart to machine code. Assembly code is line-oriented human readable text that lots of people write by hand. You use an "assembler" to turn assembly code into executable machine code. You can use a "disassembler" to turn executable code into assembly code (try the NetRun "Disassemble" checkbox for the code above!).
Instruction. One command for the CPU to execute. Can be thought of as: one line of assembly code, or one set of machine code bytes. Depending on the context, sometimes "instruction" includes the operands ("the 'push ebp' instruction"), sometimes it doesn't ("the 'push' instruction, with operand 'ebp'.").
Operands. The parameters taken by various instructions. These parameters can be constants, like the "0x7" above, addresses in memory, or registers.
Registers. Registers are little storage locations built into the CPU. They're used like variables in assembly language--you spend most of your time putting values into registers, doing arithmetic on register values, and moving values around between registers. For example, "ebp", "esp", and "eax" above are all registers. Unlike variables, there are a fixed number of registers (built into the design of the CPU), and the registers have fixed names (built into the assembler).

Here's a typical line of assembly code. It's one CPU instruction, with a comment:

	mov eax,1234 ;  I'm returning 1234, like the homework says...

(executable NetRun link

x86

Here are the commonly-used x86 registers:

eax. This is the register that stores a function's return value.
eax, ecx, edx, esi, edi. "Scratch" registers you can always overwrite with any value.
rdi, rsi, rdx, rcx, ... In 64-bit mode, these registers contain function arguments, in left-to-right order.
esp, ebp. Registers used to run the stack. Be careful with these!

There are some other older or newer and much more rarely-used x86 registers:

Size	Register names	Meaning (note: not the official meanings!)	Introduced in
8-bit	al,ah, bl,bh, cl,ch, dl,dh	"Low" and "High" parts of bigger registers	1972, Intel 8008
16-bit	ax, bx, cx, dx, si, di, sp, bp	"eXtended" versions of the original 8-bit registers	1978, Intel 8086/8088
32-bit	eax, ebx, ecx, edx, esi, edi, esp, ebp	"Extended eXtended" registers	1985, Intel 80386
64-bit	rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8, r9, r10, r11, r12, r13, r14, r15	"Really eXtended" registers	2003, AMD Opteron / Athlon64 2004, Intel EM64T CPUs

x86 is rather unique in that all the smaller registers from bygone eras are still right there as *part* of the new, longer registers. So for example, this code returns 0x0000AB00, because 0xAB is put into the next-to-lowest byte of eax:

	mov eax,0 ; Clear eax
	mov ah,0xAB ;  Move "0xAB" into the next-to-the-last byte of eax

(executable NetRun link)

Despite the weird names, the x86 registers are just encoded numerically in a 3-bit field for the eight options for 32-bit registers, or a 4-bit field for the 16 registers in 64-bit mode.

PowerPC

All of the above is for ordinary x86 machines (Intel, AMD, etc.) What about for PowerPC machines, like old Macs, the Xbox360 or the PlayStation 3? Well, the assembly code is very different in the gory details, but in the abstract it is absolutely identical:

        Machine Code:   Assembly Code:
Address             Instruction Operands
   0:	38 60 00 07 	li	r3,7
   4:	4e 80 00 20 	blr

Like x86, PowerPC machine code consists of bytes, with addresses, that represent assembly instructions and operands. PowerPC machine code also spends most of its time manipulating values in registers.

Unlike x86, there are 32 PowerPC registers, but the registers have uninteresting names (they're called r0 through r31). The names of the instructions are different; "li" in PowerPC (Load Immediate) is about like a "mov" in x86; "blr" (Branch to Link Register) serves the same purpose as "ret" in x86. With 32 registers, you need 5 bits to indicate the register number.

MIPS

The return value register (MIPS equivalent of eax) is "$2". To return from a function, jump to the link register $31. One instruction after a branch always executes, even if the branch is taken; this is the "branch delay slot".

li $2,0xb0b
jr $31

(Try this in NetRun now!)

MIPS is a 4-byte per instruction RISC machine almost identical to PowerPC.

CISC: Two Operand Operations

Binary operators are the most common: + - * / & | ^ << >>. Hence on many CISC machines such as x86, most instructions take just two operands, and the left operand is reused as the destination register:

mov eax,7
mov ecx,2
add eax,ecx
ret

(Try this in NetRun now!)

Here, the "add" instruction is really "a+=b".

One advantage of two-operand instructions is that you have only two operands, which saves bits. You can now use those saved bits to add new funky addressing modes! For example, x86 can encode all sorts of weird operand locations via the ModR/M byte. The ModR/M byte is what allows the same add instruction above to be used like "add eax,[ecx+edx*4+0x1979]", which accesses memory at base address ecx plus four times edx (like an "int" array) plus a constant offset (like a struct).

RISC: Three+ Operand Operations

RISC machines like PowerPC or MIPS often use three-operand operations: both source registers and the destination register are specified. If everything comes from a register, this doesn't actually take too many bits--three operands at five bits each is just fifteen bits, which in a 32-bit instruction leaves plenty of room for the opcode, any constants, padding, future expansion, etc.

For example, a MIPS add looks like this:

li $5,7
li $6,2
add $2,$5,$6
jr $31
nop

(Try this in NetRun now!)

The "add" instruction is really "a = b + c".

A RISC multiply-add instruction (a = b*c + d) is actually a four-operand operation! Both PowerPC and Itanium have multiply-adds.

Instruction Encoding Survey

It's informative to look at the disassembly of this code on several CPUs:

return 0xdeadbeef;

(Try this in NetRun now!)

On x86, there's a one-byte load prefix, followed by the 4-byte little-endian constant:

   0:	b8 ef be ad de       	mov    eax,0xdeadbeef
   5:	c3                   	ret

On PowerPC, because instructions are just 32 bits, you've got to split the 4-byte constant across two instructions, "load immediate shifted" the high 16 bits, then "or immediate" pastes in the low 16 bits. PowerPC is big-endian.

   0:	3c 60 de ad 	lis	r3,-8531
   4:	60 63 be ef 	ori	r3,r3,48879
   8:	4e 80 00 20 	blr

On MIPS, you also have the low half/high half split. MIPS has a "branch delay slot" after every branch, which always gets executed even if the branch is taken:

[   0] 0x   c:  3c 02 de ad           lui	r2,0xdead	
[   0] 0x  10:  03 e0 00 08           jr	r31
[   0] 0x  14:  34 42 be ef           ori	r2,r2,0xbeef

On SPARC, you also have a branch delay, and the constant is split up across instructions. But it's split oddly--first you get the high 22 bits with "sethi", and then the low 10 bits:

   0:	11 37 ab 6f 	sethi  %hi(0xdeadbc00), %o0
   4:	81 c3 e0 08 	retl 
   8:	90 12 22 ef 	or  %o0, 0x2ef, %o0	! deadbeef <foo+0xdeadbeef>

On DEC Alpha, the only big surprise is that the machine code is little-endian. "lda" actually adds, not OR's, the sign-extended "0xffffbeef" constant to "0xdeae0000", so the sign-extension combines with the high bits to give "0xdeadbeef" in register v0 on return.

   0:	ae de 1f 24 	ldah	v0,-8530
   4:	ef be 00 20 	lda	v0,-16657(v0)
   8:	01 80 fa 6b 	ret

Overall, you can see that all these RISC machines use four bytes per instruction. That extra space actually adds up, as we'll see next.

Here's Itanium, Intel's strange 1990's 64-bit architecture. Instructions come in quasi-VLIW 3-instruction "packets" of 16 bytes, but often the packets are full of no-ops:

   0:	04 00 00 00 01 80 	[MLX]       nop.m 0x0
   6:	de ff ff ff 7f 00 	            movl r8=0xffffffffdeadbeef
   c:	f1 b6 f5 6d 
  10:	1d 00 00 00 01 00 	[MFB]       nop.m 0x0
  16:	00 00 00 02 00 80 	            nop.f 0x0
  1c:	08 00 84 00       	            br.ret.sptk.many b0;;

Comparing Executable Size

One way to compare instruction sets is to compile the same program for several different CPUs. Here are some executable sizes from the various ports of NetBSD, to DEC alpha, AMD64 (x86-64), i386 (normal x86), and Mac PowerPC:

Bytes  Platform     Program
 36231 alphaInstall/bin/ls
 30844 amdInstall/bin/ls
 25781 i386Install/bin/ls
 29843 ppcInstall/bin/ls

314181 alphaInstall/bin/ksh
242843 amdInstall/bin/ksh
198403 i386Install/bin/ksh
240676 ppcInstall/bin/ksh

 10392 alphaInstall/usr/bin/env
  9770 amdInstall/usr/bin/env
  6690 i386Install/usr/bin/env
  7397 ppcInstall/usr/bin/env

112103 alphaInstall/usr/bin/yacc
 86140 amdInstall/usr/bin/yacc
 71033 i386Install/usr/bin/yacc
 82783 ppcInstall/usr/bin/yacc

170865 alphaInstall/usr/bin/g++
140117 amdInstall/usr/bin/g++
121689 i386Install/usr/bin/g++
142463 ppcInstall/usr/bin/g++

440233 alphaInstall/usr/bin/vi
374261 amdInstall/usr/bin/vi
308357 i386Install/usr/bin/vi
354142 ppcInstall/usr/bin/vi

854584 alphaInstall/usr/bin/cvs
652278 amdInstall/usr/bin/cvs
573478 i386Install/usr/bin/cvs
649416 ppcInstall/usr/bin/cvs

Plain x86 executables are reliably over 10% smaller than either 64-bit x86 or PowerPC; and over 30% smaller than DEC Alpha executables. Likely, this depends on the compiler and runtime system somewhat, but there really is an advantage to x86's complex encoding.

Surprisingly, compressing these binaries does not eliminate x86's advantage here, though they do reduce it somewhat:

32500081 alpha/alpha/binary/sets/base.tgz
27790777 amd/amd64/binary/sets/base.tgz
24908886 i386/i386/binary/sets/base.tgz
27816531 mac/macppc/binary/sets/base.tgz

Instruction Frequencies

For maximum data compression, Huffman codes use short codes to represent common symbols, and longer codes for the more rarely encountered symbols. Hence it's informative to look at the statistical breakdown of instructions used in large programs, such as the Linux C library on x86:

Instruction usage breakdown (by popularity):
	42.4% mov instructions
	5.0% lea instructions
	4.9% cmp instructions
	4.7% call instructions
	4.5% je instructions
	4.4% add instructions
	4.3% test instructions
	4.3% nop instructions
	3.7% jmp instructions
	2.9% jne instructions
	2.9% pop instructions
	2.6% sub instructions
	2.2% push instructions
	1.4% movzx instructions
	1.3% ret instructions
	...

This makes a little more sense broken into categories:

Load and store:	about 50% total
	42.4% mov instructions
	2.9% pop instructions
	2.2% push instructions
	1.4% movzx instructions
	0.3% xchg instructions
	0.2% movsx instructions

Branch: about 25% total
	4.9% cmp instructions
	4.7% call instructions
	4.5% je instructions
	4.3% test instructions
	3.7% jmp instructions
	2.9% jne instructions
	1.3% ret instructions
	0.4% jle instructions
	0.4% ja instructions
	0.4% jae instructions
	0.3% jbe instructions
	0.3% js instructions

Arithmetic: about 15% total
	5.0% lea instructions (uses address calculation arithmetic)
	4.4% add instructions
	2.6% sub instructions
	1.0% and instructions
	0.5% or instructions
	0.3% shl instructions
	0.3% shr instructions
	0.2% sar instructions
	0.1% imul instructions

So for this piece of code, the most numerically common instructions on x86 are actually just memory loads and stores (mov, push, or pop), followed by branches, and finally arithmetic--this low arithmetic density was a surprise to me! You can get a little more detail by looking at what stuff occurs in each instruction:

Registers used:
	30.9% "eax" lines (eax is the return result register, and general scratch)
	5.7% "ebx" lines (this register is only used for accessing globals inside DLL code)
	10.3% "ecx" lines
	15.5% "edx" lines
	11.7% "esp" lines (note that "push" and "pop" implicitly change esp, so this should be about 5% higher)
	25.9% "ebp" lines (the bread-and-butter stack access base register)
	12.0% "esi" lines
	8.6% "edi" lines

x86 does a good job of optimizing access to the eax register--many instructions have special shorter eax-only versions. But it should clearly be doing the same thing for ebp, and it doesn't have any special instructions for ebp-relative access.

Features used:
	66.0% "0x" lines (immediate-mode constants)
	69.6% "," lines (two-operand instructions)
	36.7% "+" lines (address calculated as sum)
	1.2% "*" lines (address calculated with scaled displacement)
	48.1% "\[" lines (explicit memory accesses)
	2.8% "BYTE PTR" lines (char-sized memory access)
	0.4% "WORD PTR" lines (short-sized memory access)
	40.7% "DWORD PTR" lines (int or float-sized memory)
	0.1% "QWORD PTR" lines (double-sized memory)

So the "typical" x86 instruction would be an int-sized load or store between a register, often eax, and a memory location, often something on the stack referenced by ebp with an immediate-mode offset. Something like 50% of instructions are indeed of this form!

Granted, this clearly depends a lot on the program you analyze, and the compiler used for that program. It's also highly dependent on the number of registers in your architecture; x86 has very few registers, meaning many functions access stuff on the stack that should be in registers. And of course this static count isn't the same as a dynamic runtime count, which is what really matters. But this sort of counting is still useful in designing better instruction sets and faster hardware--the bottom line is that most of real code is just data shuffling, with a low fraction of actual arithmetic.

Here's the little UNIX shell script I wrote to tally up the instructions used by an x86 program:

#!/bin/sh

file="$1"
d="dis.txt"
objdump -drC -M intel "$file" | \
	awk -F: '{print substr($2,24);}' | \
	grep -v "^$"  > "$d"
tot=`wc -l $d | awk '{print $1}'`
echo "$tot instructions total"

echo "Instruction usage breakdown:"
sort $d | awk '{
	if ($1==last) {count++;} 
	else {print count, last; count=0; last=$1;}
}' | \
sort -n -r | \
awk '{printf("	%.1f%% %s instructions\n",$1*100.0/'$tot',$2);}' \
> dis_instructions.txt 
head -15 dis_instructions.txt 

echo "Register and feature usage:"
for reg in eax ebx ecx edx esp ebp esi edi \
	"0x" "," "+" "*" "\[" \
	"BYTE PTR" "[^D]WORD PTR" "DWORD PTR" "QWORD PTR"
do
	c=`grep "$reg" "$d" | wc -l | awk '{print $1}'`
	echo | awk '{printf("	%.1f%% \"'"$reg"'\" lines\n",'$c'*100.0/'$tot');}'
done

Adiabatic Circuits: The Far Future

Evidently the low-power computing regime is dominated by adiabatic circuit designs.

	Ordinary Computer	Adiabatic Circuits
Goal	Get 'er done!	Substantially lower power use, especially at low clockrate.
Data storage	1's and 0's (bits)	1's and 0's (bits)
Assignments?	Yes	No (uses energy = kT ln 2)
Reversible?	No	Yes
Swap?	Yes	Yes
Logic gates	AND, OR, NOT	NOT, CNOT, CCNOT
Programming Model	Instructions	Reversible Instructions
Clock	Square wave	Two trapezoidal waves
When?	Now	Slowly, over next ten years?
Limits	Heat/power, hard problems	Only helps at low clockrate

Adiabatic circuits are based on a few interesting fundamental observations about modern circuit efficiency:

Slamming a wire from a 1 down to a 0 means dumping the 1's energy directly into the ground. It's more efficient to shuffle the 1 off somewhere else than to erase it entirely.
Writing a zero into a register actually reduces entropy: the universe has lost a disordered register value, and now has an orderly zero-filled register. You can't reduce entropy in one place without increasing it (even more) somewhere else, so writing a zero actually consumes energy, no matter how the circuitry manages the write.

Saed Younis' 1994 MIT PhD Thesis outlines the basic adiabatic circuit model and its inherent power advantage, which is up to several hundredfold at sufficiently low clock rates. These design principles have slowly been trickling into CPU designs piece by piece; still, most circuits are non-reversible and non-adiabatic today. The path to a fully adiabatic computation, assuming we ever get there, will have to change the instruction set at some point, because tons of operations are destructive, like "mov" (which irrevocably destroys the destination), and will need to be replaced with "swap" (which doesn't destroy information). Some future fully-adiabatic CPU will need reversible instructions, and in fact the compiler will probably need to generate entire "antifunctions" to undo the operation of each function. To really be 100% reversible, the display would have to suck the final answer back out of your brain, so some degree of irreversibility is usually considered acceptable in designing real systems.

The energy kT ln 2 needed to erase one digit near room temperature is about 10^-20 joules, which is... small. If your machine is chewing through 10 billion bits per clock cycle, and does 10 billion clocks per second, this is one joule/second, or one watt of irreducible "bit erasure cost". A typical desktop CPU is much smaller than this, and uses 30-100 watts for other things, so this isn't a very big effect yet. But depending on how silicon fabrication scales up, it could become a show-stopper in the next ten years, and drive us toward reversible programs.