Assembly Language

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

Here is a very simple assembly language function, the equivalent of the C++ "return 7;"

mov rax,7
ret

(Try this in NetRun now!)

"mov" is an instruction that moves values around.

Here we're moving the value 7 into the register "rax". Where any sane language would use variables, assembly language uses registers. The names and number of registers are baked into the design of the instruction set, and can't be changed.

The next instruction is "ret". It returns from the current function. By convention, the value returned is the value you stored in rax.

Here's a slightly more complex program:

mov rax,3
add rax,5
ret

This returns 8, because we've moved 3 into rax, then added 5 to it. In NetRun, hit "Options -> TraceASM" to see the changes to the register values after each line.

One full assembly language instruction consists of an 'opcode' like 'mov' that says what to do, and a list of 'operands' like registers or immediate constants.
Portions of an assembly instruction as described above.

The most common assembler error is "invalid combination of opcode and operands", which means the hardware doesn't support your operand list.

Assembly Formatting: line oriented

Unlike C/C++, assembly language is not case sensitive. This means "mov rax,7" and "mOv RaX, 7" are equivalent.

A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly! I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:

	mov rcx, 5;  mov rax, 3;   Whoops!

It doesn't look like it, but the semicolon makes that second instruction A COMMENT!

Unlike C/C++, assembly is line-oriented, so you need a newline after each line of assembly, and the following WILL NOT WORK:

	mov rax,
	         5

Line-oriented stuff is indeed annoying. Be careful that your editor doesn't mistakenly add newlines to long lines of text!

Simple Assembly Registers

In assembly you don't have variables, but operate on data in registers. A register is actually a tiny piece of memory hardware inside the CPU, with a fixed size. When the CPU executes a line like "mov rax,7" it stores the constant 7 into the register rax, which is 64 bits wide, the width of a "long int" in C or C++ (on a 64-bit machine). Just like most C++ programs spend their time shuffling values between variables, most assembly programs spend their time shuffling values between registers.

Here are some of the more friendly, easy to use registers, and who uses them.

Notes	64-bit
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too.	rax
Scratch register. Some instructions use it as a counter (such as SAL or REP).	rcx
Scratch register. Multiply instructions put the high bits of the result here.	rdx
Scratch register. Function argument #1 in 64-bit Linux.	rdi
Scratch register. Also used to pass function argument #2 in 64-bit Linux.	rsi

The full 64-bit rax register also has a 32-bit lower half called eax, which we'll cover later. There are also other less friendly registers, such as "rbx", that complain if you mess with them.

The big problem with registers is they're in *hardware*: you're stuck with the existing names and sizes, and every function has to share them, just like global variables. If you made up a new language where there are only five global variables, with weird hardcoded names, you'd be laughed straight to the HR office to be fired!

Simple Assembly Instructions

Assembly is a very strange language, designed mostly around the machine it runs on, not around the programmer. For example, "mov" and "ret" are instructions for the CPU to execute. You can't add new instructions without changing the CPU; for example, Intel added the instruction "aesenc" (AES encryption) in 2010. There are hundreds of instructions added over the years, but some commonly used instructions are:

Opcode	C++	Examples
add	+	add rcx,13 add rax,rcx
sub	-	sub rax,1 sub rax,rcx
imul	*	imul rax,rcx
and	&	and rax,rcx
or	\|	or rax,rcx
xor	^	xor rax,rcx
not	~	not rax

Note that they *all* take just two registers, the destination and the source register.

Be careful doing these! Assembly is *line* oriented, so you can't say anything like this:
add rdx, rax-rcx ; won't work, need to do the subtraction on a second line
add rdx,(sub rax,rcx) ; won't work, parenthesis don't let you have more than one instruction per line

but you can say:
sub rax,rcx
add rdx,rax

In assembly, arithmetic has to be broken down into one operation at a time, one instruction per line. This is actually a limitation of the instruction set, a reflection of the hardware available at the time Intel designed these instructions.

One caution: if you see some assembly where the register names have a percent sign in front of them, like "%rax", you're probably looking at the GNU AT&T syntax (common on Linux), which annoyingly puts all the registers in the reverse order from the Intel syntax we'll be using (common on Windows).

Arithmetic In Assembly

Here's how you add two numbers in assembly:

Put the first number into a register
Put the second number into a register
Add the two registers
Return the result

Here's the C/C++ equivalent:

int a = 3;
int c = 7;
a += c;
return a;

And finally here's the assembly code:

mov rax, 3
mov rcx, 7
add rax, rcx
ret

(executable NetRun link)

Assembly and Machine Code

Assembly, Disassembly and Machine Code

When we write C++ code, the compiler transforms it into executable machine code that actually runs on the CPU hardware. Machine code is a block of binary data that is line-for-line equivalent to assembly language:

0000000000000000 <foo>:
   0:	b8 07 00 00 00       	mov    eax,0x7
   5:	c3                   	ret   

        Machine Code (in hex)   Assembly Language

An assembler converts assembly source code into binary machine code so you can run it on the CPU.
A disassembler converts a runnable program (binary machine code) into human readable assembly.

One way to learn some of the details of assembly language on a new machine is to use a "Disassembler" to see what the compiler generates from your C++ code. (In NetRun, Options -> Actions -> Disassemble, then run.) For example, given this C++ code:

long foo() {
	return 7;
}

(Try this in NetRun now!)

We can compile this using: g++ code.c -c -fomit-frame-pointer

We can then disassemble it with: objdump -drC -M intel code.o

This produces the disassembly shown above.