mov rax,7 ret
"mov" is an instruction
that moves values around.
Here we're moving the value 7 into the register "rax". Where any sane language would use variables, assembly language uses registers. The names and number of registers are baked into the design of the instruction set, and can't be changed.
The next
instruction is "ret". It returns from the current
function. By convention, the value returned is the value you
stored in rax.
Here's a
slightly more complex program:
mov rax,3
add rax,5
ret
This returns
8, because we've moved 3 into rax, then added 5 to it. In
NetRun, hit "Options -> TraceASM" to see the changes to the
register values after each line.
One full assembly language instruction consists of an 'opcode' like 'mov' that says what to do, and a list of 'operands' like registers or immediate constants.
The most common assembler error is "invalid combination of opcode and operands", which means the hardware doesn't support your operand list.
Unlike C/C++, assembly language is not case sensitive. This means "mov rax,7" and "mOv RaX, 7" are equivalent.
A semicolon indicates the start of a comment. Unlike in C/C++/Java/C#/..., semicolons are OPTIONAL in assembly! I usually leave off the semicolons for lines without comments, because otherwise I find myself tempted to do this:
mov rcx, 5; mov rax, 3; Whoops!
It doesn't look like it, but the semicolon makes that second instruction A COMMENT!
Unlike C/C++, assembly is line-oriented, so you need a newline after each line of assembly, and the following WILL NOT WORK:
mov rax,
5
Line-oriented
stuff is indeed annoying. Be careful that your editor
doesn't mistakenly add newlines to long lines of text!
In assembly you don't have variables, but operate on data in registers. A register is actually a tiny piece of memory hardware inside the CPU, with a fixed size. When the CPU executes a line like "mov rax,7" it stores the constant 7 into the register rax, which is 64 bits wide, the width of a "long int" in C or C++ (on a 64-bit machine). Just like most C++ programs spend their time shuffling values between variables, most assembly programs spend their time shuffling values between registers.
Here are some
of the more friendly, easy to use registers, and who uses them.
Notes | 64-bit |
Values are returned from functions in this register. Multiply instructions put the low bits of the result here too. | rax |
Scratch register. Some instructions use it as a counter (such as SAL or REP). | rcx |
Scratch register. Multiply instructions put the high bits of the result here. | rdx |
Scratch register. Function argument #1 in 64-bit Linux. | rdi |
Scratch register. Also used to pass function argument #2 in 64-bit Linux. | rsi |
The full
64-bit rax register also has a 32-bit lower half called eax, which
we'll cover later. There are also other less friendly
registers, such as "rbx", that complain if you mess with them.
The big problem with registers is they're in *hardware*: you're stuck with the existing names and sizes, and every function has to share them, just like global variables. If you made up a new language where there are only five global variables, with weird hardcoded names, you'd be laughed straight to the HR office to be fired!
Assembly is a very strange language, designed mostly around the machine it runs on, not around the programmer. For example, "mov" and "ret" are instructions for the CPU to execute. You can't add new instructions without changing the CPU; for example, Intel added the instruction "aesenc" (AES encryption) in 2010. There are hundreds of instructions added over the years, but some commonly used instructions are:
Opcode | C++ | Examples |
add | + | add rcx,13 add rax,rcx |
sub | - | sub rax,1 sub rax,rcx |
imul | * | imul rax,rcx |
and | & | and rax,rcx |
or | | | or rax,rcx |
xor | ^ | xor rax,rcx |
not | ~ | not rax |
Note that they *all* take just two registers, the destination and the source register.
Be careful
doing these! Assembly is *line* oriented, so
you can't say anything like this:
add rdx, rax-rcx
; won't work, need to do the subtraction on a second line
add rdx,(sub rax,rcx) ; won't
work, parenthesis don't let you have more than one instruction per
line
One caution: if you see some assembly where the register names have a percent sign in front of them, like "%rax", you're probably looking at the GNU AT&T syntax (common on Linux), which annoyingly puts all the registers in the reverse order from the Intel syntax we'll be using (common on Windows).
Here's how you add two numbers in assembly:
Here's the C/C++ equivalent:
int a = 3;
int c = 7;
a += c;
return a;
And finally here's the assembly code:
mov rax, 3
mov rcx, 7
add rax, rcx
ret
When we write
C++ code, the compiler transforms it into executable machine code
that actually runs on the CPU hardware. Machine code is a
block of binary data that is line-for-line equivalent to assembly
language:
0000000000000000 <foo>: 0: b8 07 00 00 00 mov eax,0x7 5: c3 ret
Machine Code (in hex) Assembly Language
An assembler
converts assembly source code into binary machine code so you can
run it on the CPU.
A disassembler converts a runnable program (binary machine code)
into human readable assembly.
One way to learn some of the details of assembly language on a new machine is to use a "Disassembler" to see what the compiler generates from your C++ code. (In NetRun, Options -> Actions -> Disassemble, then run.) For example, given this C++ code:
long foo() { return 7; }
We can compile this using: g++ code.c -c -fomit-frame-pointer
We can then
disassemble it with: objdump -drC -M intel code.o
This produces the disassembly shown above.