Pointers and Pointer Arithmetic

CS 301 Lecture, Dr. Lawlor

Pointers in C++ are considered fairly tricky. Part of the problem is data types: everybody understands an "int", but what is an "int *" (pointer to int) really?

Pointers in assembly language have much simpler syntax: [rax] means go out to memory at the address stored in register rax. That address is always measured in bytes, and is called a "pointer", but it's just a number in rax.

THE key to understanding pointers is to realize that everything in memory is stored as a flat one-dimensional sequence of bytes. The first byte is 0x00000000, and the last byte is 0xFFFF...FFF. To get to the next byte, add one to the pointer. To get to the next 32-bit int, add four (bytes) to the pointer. To get to the next 64-bit long, add eight bytes to the pointer.

Human	C++	Assembly
Declare a long integer.	long y;	rdx (nothing to declare, just use a register)
Copy one long integer to another.	y=x;	mov rdx,rax
Declare a pointer to an long.	long *p;	rax (nothing to declare, use any 64-bit register)
Dereference (look up) the long.	y=*p;	mov rdx,[rax]
Find the address of a long.	p=&y;	mov rax,place_you_stored_Y
Access an array (easy way)	y=p[2];	(sorry, no easy way exists!)
Access an array (hard way)	p=p+2; y=*p;	add rax,2*8; (move forward by two 8 byte longs) mov rdx,[rax] ; (grab that long)
Access an array (too clever)	y=*(p+2)	mov rdx,[rax+2*8]; (yes, that actually works!)

So far, the only thing we have declared in memory from assembly is code. We can look at the machine code generated by dereferencing a pointer:

mov rcx,my_code ; rcx == address of the code
mov rax,[rcx] ; load the bytes of machine code
ret

my_code:
  add eax,0  ; this *happens* to be opcode 5, in machine code

(Try this in NetRun now!)

This returns 5, because the "add eax" opcode is 0x05. Clearly, this is not an easy way to make arbitrary constants! Luckily, there's a "pseudo instruction" named "dq" (data QWORD) that inserts an 8-byte constant into the code at that point:

mov rcx,my_code ; rcx == address of the code
mov rax,[rcx] ; load the bytes of machine code
ret

my_code:
  dq 4 ; this inserts a literal value, stored as an 8-byte 64-bit "long".

(Try this in NetRun now!)

Note that the disassembly shows this as "add al", but we're just using it as a value!

An "array" is just a sequence of values stored in ascending order in memory. Since we listed the "dq" in order, they show up in memory in that order, so we can do pointer arithmetic to pick the value we want. This returns 7:

mov rcx,my_arr ; rcx == address of the array
mov rax,[rcx+1*8] ; load element 1 of array
ret

my_arr:
  dq 4 ; array element 0, stored at [my_arr]
  dq 7 ; array element 1, stored at [my_arr+8]
  dq 9 ; array element 2, stored at [my_arr+16]

(Try this in NetRun now!)

Keep in mind that each element here is a "dq" or an 8-byte long, so I move down by 8 bytes during indexing, and I load into the 64-bit "rax".

If the array is of 4-byte integers, we'd declare them with "dd" (data DWORD), move down by 4 bytes per int array element, and store the answer in a 32-bit register like "eax". But the pointer register is always 64 bits!

mov rcx,my_arr ; rcx == address of the array
mov eax,[rcx+1*4] ; load element 1 of array
ret

my_arr:
  dd 0xaaabbbcc ; array element 0, stored at [my_arr]
  dd 0xc001007  ; array element 1, stored at [my_arr+4]

(Try this in NetRun now!)

Here's how you declare, store, and address values of all the sizes. The register names are just examples so you get the sizes right; you can load or store from any register.

Bits	C++	Assembly Create	Assembly Read	Example
8	char	db (data byte)	mov al, BYTE[rcx+i*1]	(Try this in NetRun now!)
16	short	dw (data WORD)	mov ax, WORD [rcx+i*2]	(Try this in NetRun now!)
32	int	dd (data DWORD)	mov eax, DWORD [rcx+i*4]	(Try this in NetRun now!)
64	long	dq (data QWORD)	mov rax, QWORD [rcx+i*8]	(Try this in NetRun now!)

It's extremely easy to have a mismatch between one or the other of these values. For example, if I declare values with dw (2 byte shorts), but load them into eax (4 bytes), I'll have loaded two values into one register. So this code returns 0xbeefaabb, which is two 16-bit values combined into one 32-bit register:

mov rcx,my_arr ; rcx == address of the array
mov eax,[rcx] ; load element 0 of array (OOPS!  32-bit load!)
ret

my_arr:
  dw 0xaabb ; array element 0, stored at [my_arr]
  dw 0xbeef ; array element 1, stored at [my_arr+2]

(Try this in NetRun now!)

You can reduce the likelihood of this type of error by adding explicit memory size specifier, like "WORD" below. That makes this a compile error ("error: mismatch in operand sizes") instead of returning the wrong value at runtime.

mov rcx,my_arr ; rcx == address of the array
mov eax, WORD [rcx] ; load element 0 of array (OOPS!  32-bit load!)
ret

my_arr:
  dw 0xaabb ; array element 0, stored at [my_arr]
  dw 0xbeef ; array element 1, stored at [my_arr+2]

(Try this in NetRun now!)

If we really wanted to load a 16-bit value into a 32-bit register, we could use "movzx" (unsigned) or "movsx" (signed) instead of a plain "mov".

Loading from the wrong place, or loading the wrong amount of data, is an INCREDIBLY COMMON problem when using pointers, in any language. You WILL make this mistake at some point over the course of the semester, so be careful!