Pointers in C and Assembly

CS 301 Lecture, Dr. Lawlor

Here's some code to access an array in C++:

int *arr=new int[10];
for (int i=0;i<10;i++)
	arr[i]=(i*10);
return arr[7];

(executable NetRun link)

There are a bunch of weird symbols in C and C++ used to work with pointers:

ptr[i]: Array index operator. Not really that weird.
*ptr: Pointer dereference operator. Lets you read or write the value the pointer points to, not the pointer itself. For example,
&val: Address-of operator--returns a pointer to val. This turns a value into a pointer, which is the opposite of the dereference operator.
ptr+1: Pointer arithmetic. Curious because if "ptr" points to 4-byte ints, this advances ptr by 4 bytes, not 1 byte like you might expect!

Note that the last line can be written in any of these ways, which are all equivalent:

arr[7]. Uses arr as an array, and grabs index 7.
*(arr+7). Uses arr as a pointer, adds 7 (elements!), and dereferences.
*(7+arr). Same pointer arithmetic, but different order. Big deal.
7[arr]. Looks like a compile error, but C and C++ are actually perfectly happy with it.
(&*arr)[7]. "*arr" is the first element of arr. "&*arr" is the address of the first element of arr, which is... back to arr again!
(&*&*&*&*&*&*&*arr)[7]. Switches back and forth between arr and the first element a half dozen times, then finally settles on element 7.

It's a curious fact, but arrays are usually indistinguishable from pointers in C or C++.

Pointers in plain old C

C can be thought of as just a portable assembler--almost every construct in C corresponds one-to-one with a single line of assembly.

So you can actually learn a lot about how to access memory using pointers by writing some low-level, old school C code.

Here's what you'd write in plain C to do the array arithmetic above. Be careful! The "malloc" routine takes as input the number of *bytes* to allocate, not the number of *ints*. We're using the "sizeof" operator to return the number of bytes in an int.

int i;
int *arr=malloc(10*sizeof(int));
for (i=0;i<10;i++)
	arr[i]=(i*10);
return arr[7];

(executable NetRun link)

In assembly, you can store a pointer in any register, such as eax, and do pointer arithmetic using the normal arithmetic instructions. So
add eax,28
might be a regular arithmetic operation (if you're thinking of eax as a normal value), or it might be pointer arithmetic (if you're thinking of eax as a pointer). You can't tell without seeing where eax was loaded from, and how it's used--there is no type information in assembly!

To dereference a pointer in assembly, you write it in brackets, like "[eax]". This treats eax as a pointer, and accesses the memory it points to.

To summarize,

C/C++	Assembly
int *p;	; Not needed--no types! (Woo hoo!) Er, now be careful, kids...
p=malloc(40);	push 40; Function arguments go on the stack in 32-bit x86 extern malloc call malloc add esp,4; Undo "push" ; Malloc's return value, a pointer, comes back in eax
p++;	add eax,4; Subtle: advance 1 int, by advancing 4 bytes
int i=*p;	mov ecx, [eax]; Treat eax as a pointer, and copy out the value it points to

Because ecx is 32 bits, the "mov" above is a 32-bit move--it reads 4 bytes from memory.

But because assembly doesn't have type information, sometimes the assembler can't figure out what you mean by a line, like "mov [eax],3". Clearly, this sets something to 3. But do you want to have eax pointing to a single byte (or char), a short, a long, or what? This will manifest itself as the fairly self-explanatory "error: operation size not specified" in NASM, but sadly you get the confusing "invalid combination of opcode and operands" in YASM.

The solution to this is just to tell the assembler what size data you're trying to point to. The sizes available in NASM/YASM are:

"byte", 8 bits. So "mov byte [eax],3" treats eax as a char *, and sets the byte it points to to 3.
"word", 16 bits. "mov word [eax],3" treats eax as a short *. On every other machine in the universe, "word" means the size of the registers. On x86, "word" means "the size the registers used to be back in DOS", which is 16 bits.
"dword", 32-bits. "mov dword [eax],3" treats eax as an int *. "dword" is still common in Windows. You can also use "long", but it's always 4 bytes unlike the C long (which depends on the machine).
"qword", 64-bits. This only works in 64-bit mode.