Pointers, Pointer Arithmetic, and Messy Uncertain Death

CS 301 Lecture, Dr. Lawlor

Pointers are one of C++'s most powerful features, but also their most dangerous. (Cue dramatic music.) In assembly, we need pointers to access memory, which is where we go once we run out of registers.

In C++, we use the "address-of" & operator to get the address of a variable, returning a pointer. It looks somehow related to bitwise AND, but it isn't. The corresponding dereference operator * looks at what the pointer points to, returning the underlying variable:

int x=3; // happy bunny
int *p=&x; // point to the happy bunny
std::cout<<"The pointer is "<<p<<endl;
int i=*p; // summon the happy bunny
return i;

(Try this in NetRun now!)

You can also do pointer arithmetic, where you change what the pointer points to. Pointer arithmetic is dangerous, because you can easily meddle with forces you can barely understand, much less control:

int x=3; // happy bunny
int *p=&x; // point to the happy bunny
for (int r=0;r<1000000;r++) p++; // move way past the bunnies
std::cout<<"The new pointer is "<<p<<endl;
int i=*p; // summon the EVIL DEMON MONSTER
return i;

(Try this in NetRun now!)

As you might expect, even reading from a way-bad pointer can cause your program to die horribly. So you have to make sure that what you're pointing to is really there. This is doubly true for writing--you have to make sure it's there, and that you're allowed to write to it. The standard way to do this is just to be very careful about how you write your pointer manipulation code.

Here's some valid pointer manipulation code, where we use ++ to move the pointer to the next integer, -- to move it back to the previous integer, and then we add two to jump over two integers at once:

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
int *p=arr; /* points to arr[0] */
std::cout<<"At p: "<< *p <<endl;
p++; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< *p <<endl;
p++; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< *p <<endl;
p--; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< *p <<endl;
p=p+2; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2:   "<< *p <<endl;
return 0;

(Try this in NetRun now!)

Note that this means that arrays are just a series of items at increasing addresses in memory. That's all an array is.

In C/C++, the compiler knows you're pointing to an integer. So when you say "p=p+2", the compiler moves the pointer by two integers, which is a total of eight bytes. You can see that byte count by printing out the pointers as they move, like the following. (Note now we're printing "p" the pointer; not "*p" the integer.)

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
int *p=arr; /* points to arr[0] */
std::cout<<"At p: "<< p <<endl;
p++; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< p <<endl;
p++; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< p <<endl;
p--; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< p <<endl;
p=p+2; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2:   "<< p <<endl;
return 0;

(Try this in NetRun now!)

Pointers literally *are* just this byte count. You can do pointer arithmetic on byte counts in C++ by typecasting your pointers to "char *", but the syntax looks a little weird, because to access an int, you have to cast back to "int *":

int arr[4];
arr[0]=100;
arr[1]=101;
arr[2]=102;
arr[3]=103;
char *p=(char *)arr; /* points to the bytes in arr[0] */
std::cout<<"At p: "<< *(int *)p <<endl;
p+=4; /* move pointer down, to arr[1] */
std::cout<<"After p++: "<< *(int *)p <<endl;
p+=4; /* move pointer down some more, to arr[2] */
std::cout<<"After another p++: "<< *(int *)p <<endl;
p-=4; /* move pointer back, to arr[1] again */
std::cout<<"And then a p--: "<< *(int *)p <<endl;
p+=8; /* moves by 2 *ints*, to arr[3] */
std::cout<<"p=p+2:   "<< *(int *)p <<endl;
return 0;

(Try this in NetRun now!)

Byte pointers are useful to learn in C++, because they're all you get in assembly language!

Pointers in Assembly Language

In assembly language, the syntax for dereferencing an int pointer (to get at what it points to) looks like this:
DWORD [ pointer ]

The array-looking square brackets say to access memory. The "DWORD" says to access four bytes of that memory like an "int". (You also occasionally see "BYTE" accesses to memory.)

For example, here we're dereferencing a pointer to a little statically allocated integer:

mov eax, DWORD [myIntPtr] ; read memory here (like C++: return *myIntPtr;)
ret 

myIntPtr:  ; A place in memory, where we're storing an integer.
	dd 123 ; "data DWORD", our integer

(Try this in NetRun now!)

You can copy a pointer value into a register, too. Here we're dereferencing a pointer stored in a register:

mov edx, someIntPtr ; copy the address myIntPtr into edx (like C++: p=someIntPtr;)
mov eax, DWORD [edx] ; read memory edx points to (like C++: return *p;)
ret 

someIntPtr:  ; A place in memory, where we're storing an integer.
	dd 123 ; "data DWORD", our integer

(Try this in NetRun now!)

A pointer to an array initially looks just like a pointer to anything else:

mov ecx, myArray ; ecx points to myArray  (like C++: p=arr;)
mov eax, DWORD [ecx] ; read memory pointed to by ecx (like C++: return *p;)
ret 

myArray:  ; A place in memory, where we're storing some integers.
	dd 100 ; "data DWORD", here our array element [0]
	dd 101 ; [1]
	dd 102 ; [2]
	dd 103 ; [3]

(Try this in NetRun now!)

But since we're pointing to an array, we can move the pointer up to point to the next element of the array, and then dereference that. Careful! In assembly, you have to adjust pointers by the number of *bytes* they point to, not integers!

mov ecx,myArray ; ecx points to myArray
add ecx,4 ; point to the next integer (four bytes down)
mov eax, DWORD [ecx] ; read memory pointed to by ecx
ret 

myArray:  ; A place in memory, where we're storing some integers.
	dd 100 ; "data DWORD", here our array element [0]
	dd 101 ; [1]
	dd 102 ; [2]
	dd 103 ; [3]

(Try this in NetRun now!)

There's even a special optional assembly syntax for accessing arrays, where you compute the memory address by starting at one register (the start of the array), and adding in a scaled version of another register (the array index):

mov ecx,myArray ; ecx points to myArray
mov edx,2 ; we want integer index [2]
mov eax, DWORD [ecx + 4*edx] ; read memory from array ecx, at int index edx
ret 

myArray:  ; A place in memory, where we're storing some integers.
	dd 100 ; "data DWORD", here our array element [0]
	dd 101 ; [1]
	dd 102 ; [2]
	dd 103 ; [3]

(Try this in NetRun now!)

One limitation of the little static arrays written above is that we can't write to these arrays. For that, we need read-write temporary storage space, and we get that from the stack.

The Stack

There's one super-important pointer that you have to use all the time in assembly language: the "stack pointer", pointed to by register "esp" (Extended Stack Pointer). "The stack" is a frequently-used area of memory that functions as temporary storage--say, as space for local variables when a function runs out of room, or to pass parameters to the next function.

Conceptually, the stack is divided into two areas: on top is the space that's in use (that you can't change!), and then below it the space that isn't in use (free space). The stack pointer points to the last in-use byte of the stack. The standard convention is that when your function starts up, you can claim some of the stack by moving the stack pointer down--this indicates to any functions you might call that you're using those bytes of the stack. You can then use that memory for anything you want, as long as you move the stack pointer back before your function returns.

Sadly, if you screw up the stack, such as by forgetting to move the stack pointer back, or overwriting part of the stack that isn't yours, then the function that called you (such as main) will crash horribly. So be careful with the stack!

Here's how we allocate one integer on the stack, then read and write it:

sub esp,4 ; I claim the next four bytes in the name of... me!

mov DWORD [esp],1492 ; store an integer into our stack space
mov eax,DWORD [esp] ; read our integer from where we stored it

add esp,4 ; Hand back the stack space
ret

(Try this in NetRun now!)

Here's how we'd allocate one hundred integers on the stack, then use just one of them:

sub esp,400 ; I claim the next four hundred bytes

mov edi,esp ; points to the start of our 100-integer array
add edi,160 ; jump down to integer 40 in the array
mov DWORD [edi],1492 ; store an integer into our stack space
mov eax,DWORD [edi] ; read our integer from where we stored it

add esp,400 ; Hand back the stack space
ret

(Try this in NetRun now!)

There are even special instructions for putting stuff onto the stack and taking it back off, called "push" and "pop":

"push thing" makes space for thing on the stack, and copies the value of thing into memory there. It's the same as "sub esp,4" and then "mov DWORD [esp],thing".
"pop thing" copies whatever is on top of the stack into thing, then removes that space from the stack. It's the same as "mov thing,DWORD [esp]" followed by "add esp,4".

These are handy if you've only got one integer to stick on or pull off the stack. In turn, *this* is really useful for function arguments, which by convention are stored on top of the stack when you call the function:

push 19
extern print_int
call print_int
pop eax ; MUST clean up the stack
ret

(Try this in NetRun now!)

This prints the "19" that's stored on top of the stack. In general, all your function arguments are stored on the stack. This means the stack is a rather funny mix of function arguments, local and temporary variables, and "other stuff" (to be discussed on Friday!).