Course Review for Final Exam

CS 301 Lecture, Dr. Lawlor

Here's a compressed summary of the topics we've covered in the entire class so far. See the lecture notes for all the gory details.

Data storage: bits, hex (4 bits), byte (8 bits), int (32 bits; 4 bytes), long long (64 bits; 8 bytes), float (32 bits; 4 bytes), double (64 bits; 8 bytes), SSE __m128 (128 bits; 16 bytes).
Bitwise operations: & | ^ ~ << >>, and how to use them to simulate branch operations.
Machine code: representing assembly language instructions as bytes. Running machine code on an emulated CPU, or the real CPU.
Registers: trashable (eax, ecx, edx, xmm0...xmm7) versus preserved (esp, ebp, esi, edi, ebx) registers. You can save preserved registers by pushing them at the start of your function and popping them before you return.
Branching / jumps / flow control:

All assembly flow control boils down to "if (something) goto somewhere;" which is written as "cmp eax,7; jne somewhere;"
A single "if" isn't too hard to write. But you really have to pay attention for complicated multi-part conditionals like "if (A || (B && C))"!
Loops are simple in principle: if we're not done with the loop, jump back to the start of the loop.
Be careful of iteration 0. Sometimes the loop body shouldn't execute at all, like "for (int i=0;i<n;i++)" when n==0.

Memory access:

You can allocate memory from the stack ("sub esp,100"), via malloc ("push 100; call malloc;"), or statically in your program ("section .data; dd 3").
Given a pointer to the first element of an int array, element i of the array is 4*i bytes higher in memory ("mov eax, DWORD[ecx + 4*edx]").
Given a pointer to a class, the pointer actually points to the class's first member. Subsequent members are higher in memory, but be careful, because the compiler may add padding for alignment!

The stack:

Above the stack pointer is claimed; below the stack pointer is unclaimed.
Move the stack pointer down to claim space (sub); move it up to release space (add).
"push" moves the stack pointer down and copies a new value into the claimed space.
"pop" copies the stack value out, and moves the stack pointer back.
You MUST restore the stack pointer to its original value before you return!
Sometimes it's handy to save a backup copy of the original stack pointer in register ebp.

Function calls:

Declare an outside function to be callable with "extern hisFunc".
Call a function with "call someFunc".
Declare one of your own functions to be visible outside with "global myFunc". From C++, declare your function extern "C" void myFunc(int someParameter);
Push a function's arguments onto the stack in right-to-left order (first argument is sitting on top on the stack).
A function's return address gets pushed on the stack during a "call" instruction, and popped off again during a "ret" instruction.
So you grab your own arguments from the stack by skipping over the return address (at DWORD[esp]) to get the first argument at DWORD[esp+4], the second argument at DWORD[esp+8], and so on.

Floating point in general:

Floating point numbers actually have three separate parts: the sign bit (positive or negative), the exponent (the power of two / location of the decimal point), and the mantissa (fraction part).
Each of these parts is a fixed size: for a single-precision float, 1 sign bit, 8 exponent bits, and 23 mantissa bits.
The exponent field is stored "biased" by 127, so you represent 2⁰ with an exponent field of 127, 2¹ with an exponent field of 128, 2² with an exponent field of 129, etc.
The mantissa field is stored with an "implicit leading 1" in front of the decimal point. This doesn't actually appear in the bits of the float, but does contribute the float's value!
The mantissa field is a fixed size. This fixed size isn't enough to correctly store some values, including infinitely repeating decimals like 1/3, or infinitely repeating binary numbers like 1/10, or a variety of large values added to or subtracted from smaller values. In these cases, you get "roundoff" error--the wrong answer!
Roundoff can often be reduced by changing your data type (using a different number of mantissa bits), or changing the order you perform the operations.
Due to roundoff, floating point values don't give the same answer when added in different orders: (a+b) + c might be slightly different from a + (b+c).
Unlike integers, floats never wrap around--if you keep adding to a float, it eventually stops changing, either at a finite value due to roundoff, or at the special "infinity" value.
Floats have several special values: "infinity", "nan" (an error value), and "denormal" (very tiny) numbers. There is normally a performance impact from processing these special values.

1970's stack-based floating point assembly:

"fld" or "fild" instructions load a float or int from memory into the "floating point register stack".
"fst" or "fist" instructions store the top of the floating-point register stack to memory. "fstp" stores and then pops the value.
"faddp" adds the top two things on the floating point register stack, and replaces them with their sum. Ditto for "fmulp", etc.

New SSE floating point assembly:

"movss xmm1, DWORD[ecx]" loads a single float from memory.
"movss DWORD[ecx], xmm1" stores a single float to memory.
"cvtsi2ss xmm1,eax" converts a single integer to a single single-precision float. "cvtss2si" goes the other way.
"addss xmm1,xmm4" adds single floats, just like "subss", "mulss", "divss", "sqrtss", etc.
"addps xmm1,xmm4" adds packed floats--four floats at once. "subps", "mulps", "divps", "sqrtps", etc do the same thing. Doing four things at once can be substantially faster than doing one thing at a time!
"movaps xmm1,[ecx]" loads four floats from aligned memory. The memory need not be 16-byte aligned, but this function is substantially slower than the aligned version.
"movups xmm1,[ecx]" loads four floats from unaligned memory. The memory need not be 16-byte aligned, but this function is substantially slower than the aligned version.
SSE can also be accessed from C++ via the "xmmintrin.h" functions: __m128 datatype, _mm_load_ps, _mm_add_ps, etc.

There will be questions on the final covering everything, from bits to function calls. But the exam's coverage will mostly be on the performance and floating-point stuff we've done since the midterm.

Examples

global foo
foo:
; Save registers
	push ebx
	push esi
	push edi

; Figure out how many values are coming
	extern read_input
	call read_input
	; eax == number of elements to read
	mov esi,0 ; i
	mov edi,eax ; n

; Allocate space for read-in values
	imul eax,4 ; number of bytes to allocate
	push eax
	extern malloc
	call malloc
	add esp,4 ; clean off stack
	mov ebx,eax; ebx == pointer to malloc'd region

; Loop over input values, 
	; for (i=0;i<n;i++) arr[i]=read_input();
	jmp loopcompare ; subtle: need this for n<=0 case...
  loopstart:
	call read_input
	mov DWORD[ebx + 4 * esi],eax
	inc esi
	loopcompare:
	cmp esi,edi
	jl loopstart

; Print out our array of input values
	push edi; number of ints to print
	push ebx ; pointer to ints to print
	extern iarray_print
	call iarray_print
	add esp,8 ; clean stack

; Clean up allocated memory
	push ebx
	extern free
	call free
	add esp,4

; Restore registers and return
	pop edi
	pop esi
	pop ebx	
	ret

(Try this in NetRun now!)

global foo
foo:
	extern read_input
	call read_input

; Don't print if the number is <= 8
	cmp eax,8
	jle skip_it

; Convert eax to float, multiply by pi, store to myRet
	cvtsi2ss xmm1,eax
	movss xmm2,DWORD[myPI]
	mulss xmm1,xmm2
	movss DWORD[myRet],xmm1

; Print out myRet
	push 1; number of floats to print
	push myRet
	extern farray_print
	call farray_print
	add esp,8 ; clean stack

skip_it:
	jmp foo ; keep re-running read_input and our little function

	ret

section .data
myPI:
	dd 3.14159265358979
myRet:
	dd 0.0

(Try this in NetRun now!)