Assembly Language

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

Recall that when we write C++ code, the compiler transforms it into executable machine code that actually runs on the CPU hardware. Machine code is line-for-line equivalent to assembly language.

One way to start learning assembly language is to use a "Disassembler" to see what the compiler generates from your code. (In NetRun, Options -> Actions -> Disassemble, then run.) For example, given this C++ code:

long foo() {
	return 7;
}

(Try this in NetRun now!)

We can compile this using: g++ code.c -c -fomit-frame-pointer

We can then disassemble it with: objdump -drC -M intel code.o

code.obj:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <foo>:
   0:	b8 07 00 00 00       	mov    eax,0x7
   5:	c3                   	ret

The stuff on the right, starting with "mov" and "ret", is assembly language.

We can take this assembly language code, convert it to machine code using an "assembler" like nasm, and run it. This works!

mov eax,7
ret

(Try this in NetRun now!)

(NetRun takes care of the function setup, which we'll get to in the next few weeks.)

Formatting

Unlike C/C++, assembly language is not case sensitive. This means "mov eax,7" and "mOv EaX, 7" are equivalent.

A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly! I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:

	mov ecx, 5;  mov eax, 3;   Whoops!

It doesn't look like it, but the semicolon makes that second instruction A COMMENT!

Unlike C/C++, assembly is line-oriented, so you need a newline after each line of assembly, and the following WILL NOT WORK:

	mov eax,
	         5

Line-oriented stuff is indeed annoying. Be careful that your editor doesn't mistakenly add newlines to long lines of text!

Instructions

Assembly is a very strange language, designed mostly around the machine it runs on, not around the programmer. For example, "mov" and "ret" are instructions for the CPU to execute. You can't add new instructions without changing the CPU; for example, Intel added the instruction "aesenc" (AES encryption) in 2010. There are hundreds of instructions added over the years, but some commonly used instructions are:

mov, move data.
ret, return from function.
add, addition.
sub, subtraction.
imul, multiply.
jmp, execute code elsewhere.
cmp, compare values.
jlt, jump if less than.

We'll be working our way through these instructions this week!

Registers

In assembly you don't have variables, but operate on data in registers. A register is actually a tiny piece of memory hardware inside the CPU, with a fixed size. When the CPU executes a line like "mov eax,7" it stores the constant 7 into the register eax, which is 32 bits wide, the width of an "int" in C or C++. Just like most C++ programs spend their time shuffling values between variables, most assembly programs spend their time shuffling values between registers.

Here are some of the more friendly, easy to use 32-bit registers, and who uses them. (There are also other registers, such as "ebx", with other purposes that we'll be covering eventually.)

Notes	32-bit
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too.	eax
Scratch register. Some instructions use it as a counter (such as SAL or REP).	ecx
Scratch register. Multiply instructions put the high bits of the result here.	edx
Scratch register. Function argument #1 in 64-bit Linux.	edi
Scratch register. Also used to pass function argument #2 in 64-bit Linux.	esi

The big problem with registers is they're in *hardware*: you're stuck with the existing names and sizes, and every function has to share them, just like global variables. If you made up a new language where there are only five global variables, with weird hardcoded names, you'd be laughed straight to the HR office to be fired!

One caution: if you see some assembly where the register names have a percent sign in front of them, like "%eax", you're probably looking at the GNU/AT&T syntax, which annoyingly puts all the registers in the reverse order from the Intel syntax we'll be using.

Arithmetic In Assembly

Here's how you add two numbers in assembly:

Put the first number into a register
Put the second number into a register
Add the two registers
Return the result

Here's the C/C++ equivalent:

int a = 3;
int c = 7;
a += c;
return a;

And finally here's the assembly code:

mov eax, 3
mov ecx, 7
add eax, ecx
ret

(executable NetRun link)

Here are some x86 arithmetic instructions. Note that they *all* take just two registers, the destination and the source.

Opcode	C++	Example
add	+	add eax,ecx
sub	-	sub eax,ecx
imul	*	imul eax,ecx
and	&	and eax,ecx
or	\|	or eax,ecx
xor	^	xor eax,ecx
not	~	not eax

Be careful doing these! Assembly is *line* oriented, so you can't say anything like this:
add edx, eax-ecx ; won't work
add edx,(sub eax,ecx) ; won't work

but you can say:
sub eax,ecx
add edx,eax

In assembly, arithmetic has to be broken down into one operation at a time, one instruction per line. This is actually a limitation of the instruction set, a reflection of the hardware available at the time Intel designed these instructions.