The ARM Instruction Set

CS 301 Lecture, Dr. Lawlor

Basically every cellphone on the planet currently uses an ARM processor, an inexpensive and energy-efficient microprocessor.  The design dates back to the 1980's, when it was the "Acorn RISC Machine" and everybody was building a RISC processor.

There's a pretty good summary of all ARM instructions, including VFP ones, over at HeyRick.  Regarding registers, briefly:
Register
AKA
Use
r0

Return value, first function argument
r1-r3

Function arguments and general scratch
r4-r11

Saved registers
r12
ip
Intra-procedure scratch register, rarely used by the linker
r13
sp
Stack pointer, a pointer to the end of the stack.  Moved by push and pop.
r14
lr
Link register, storing the address to return to when the function is done.  Written by "bl" (branch and link, like function call), often saved with a push/pop sequence, read by "bx lr" (branch to link register) or the pop.
r15
pc
Program counter, the current memory address being executed.  It's very unusual, but handy, to have the program counter just be another register--for example, you can do program counter relative addressing very easily, by just loading from [pc+addr].

Like x86 64-bit, you need to align the stack *if* you're calling a function that uses floating point, but only to 8 bytes (not 16 bytes).  All the gory details are in the ARM Architecture Procedure Call Standard, if you care.

ARM Examples

Return a small constant.  Like PowerPC, there's not enough bits to have any 32-bit constant.  Unlike PowerPC, instead of using 16 bit constants, they chose to combine an 8-bit constant with a 4-bit rotate(!).  For bigger values, just load them from memory as shown below.
mov r0,#17   @ r0 is return value register
bx lr @ return from function

(Try this in NetRun now!)

Save some registers, and do some three-operand arithmetic.
push {r4-r7,lr}

mov r4,#10
mov r5,#100
add r0,r4,r5

pop {r4-r7,pc} @ interesting hack: pop into the program counter to return from function

(Try this in NetRun now!)

Call a function.
push {lr}   @ must save link register if we call our own function

mov r0,#123 @ r0 is first function parameter
bl print_int @ branch-and-link (exactly like PowerPC)

pop {pc} @ interesting hack: pop into the program counter to return from function

(Try this in NetRun now!)

Memory addressing is a little weird.  As far as I can tell, you need to first load the memory address, then do the actual memory access. 
adr r2,mydata        @ r2 is our memory address (program counter relative)
ldr r0,[r2] @ actually load data
bx lr

mydata:
.word 123

(Try this in NetRun now!)

There's also an equivalent(?) syntax using an equals sign, although to me it's more confusing, and this might just be a GNU thing.
ldr r2,=mydata        @ r2 is our memory address (program counter relative)
ldr r0,[r2] @ actually load data
bx lr

mydata:
.word 123

(Try this in NetRun now!)

Here we're loading the address of an array to use as a function argument.
push {lr}     @ must save lr since we call a function

adr r0,mydata @ first parameter: array memory address (program counter relative)
mov r1,#2 @ second parameter: array length
bl iarray_print

pop {pc} @ function return

mydata:
.word 123
.word 456

(Try this in NetRun now!)

Generally, ARM integer instructions are similar to PowerPC.

Floating Point via VFP

For floating point registers, ARM uses a fairly standard even-odd division to store single and double precision floats in the same storage.  This means "D0" stores one double, or you can store two single precision floats in "S0" and "S1" using the same bits.  Similarly, D1 overlaps S2 and S3.  See the ARM floating point register diagram.  Here's an ARM assembly example where we load up a constant, add it to itself, and store it back to memory for printing:
push {r4,lr}    @ (note: we push r4 too, just for 8-byte stack alignment}
sub sp,sp, 32 @ make plenty of space on the stack

adr r0,.myfloats @ makes r0 point to myfloats
flds s0,[r0] @ load single-precision float (from constant below)
fadds s0,s0,s0 @ add to itself
fsts s0,[sp] @ store out to the stack

mov r0,sp @ location of floats to print
mov r1,1 @ number of floats to print
bl farray_print @ print some floats (FAILS if stack is not 8-byte aligned!)

add sp,sp,32 @ hand back stack space
pop {r4,pc} @ restore link register, and return

.myfloats: @ Note that this is read-only constant space (segfault on store!)
.word 0x3F9E0419 @ floating point 1.2345
@ Generate constants above via C++: "float x=10.0; return *(int *)&x;"

(Try this in NetRun now!)

(Note: I just added ".syntax unified" to NetRun's boilerplate code, so you no longer need # in front of constants.)

ARM offers a very interesting "rotating register banks" vector setup.  Bank 0 (registers D0-D4, or S0-S7) are always single scalar values, but if you set the funky FPSCR LEN field to a nonzero vector length, then Banks 1 through 3 can operate in vector mode.   If you set FPSCR's LEN field to 4, for example, an operation like
  
  fadds s8,s8,s16

actually adds four floats: S8+=S16; S9+=S17; S10+=S18; and S11+=S19; 

This ability to mix and match vector operations (on Banks 1-3) and scalar operations (in Bank 0) is quite handy, although I don't like having to store the vector length in LEN.   Loads and stores never go vector according to LEN, but FLDM/FSTM can load multiple registers already.

Here's an example of using LEN=4 vectors:
push {r4,lr}    @ (note: we push r4 too, just for 8-byte stack alignment}
sub sp,sp, 32 @ make plenty of space on the stack

@ Enter vector compute mode
FMRX r12,FPSCR @ copy FPSCR into r12
BIC r12,r12,#0x00370000 @ clears STRIDE and LEN
ORR r12,r12,#0x00030000 @ sets STRIDE = 1, LEN = 4
FMXR FPSCR,r12 @ copy r12 back into FPSCR

adr r0,.myfloats @ makes r0 point to myfloats
fldmias r0,{s8-s11} @ load four single-precision floats (from constants below)
fadds s8,s8,s8 @ add *four* floats (from LEN above)
fstmias sp,{s8-s11} @ store four single-precision floats (to the stack)

@ Leave vector compute mode
BIC r12,r12,#0x00370000 @ clears STRIDE =1 and LEN = 1
FMXR FPSCR,r12 @ copy r12 back into FPSCR

mov r0,sp @ location of floats to print
mov r1,4 @ number of floats to print
bl farray_print @ print some floats (FAILS if stack is not 8-byte aligned!)

add sp,sp,32 @ hand back stack space
pop {r4,pc} @ restore link register, and return

.myfloats: @ Note that this is read-only constant space (segfault on store!)
.word 0x3F9E0419 @ floating point 1.2345
.word 0x42C80000 @ floating point 100.0
.word 0x41200000 @ floating point 10.0
.word 0x4048F5C3 @ floating point 3.14
@ Generate constants above via C++: "float x=10.0; return *(int *)&x;"

(Try this in NetRun now!)

Generally, the vector operations seem to be quite fast, taking only a little longer than the scalar versions.  In addition, unlike many chip designers, ARM publishes detailed execution information, including cycle counts, pipeline hazards and scoreboarding, so you have something to start with during optimization!