Locality in Memory Access & Branching

CS 301: Assembly Language Programming Lecture, Dr. Lawlor

Locality, also known as "coherence", is the property that:
The future will be like the past.
Hardware designers use locality to predict the future, which allows it to start things early, so they finish sooner.

Software can be written to have good locality, which helps the hardware run it faster, or poor locality, which makes it difficult to run fast. 

We've seen the serious effects of memory access locality on cache performance, such as the hardware's cache prefetch.

Many of these same effects apply to branches.  For example, this simple loop runs very fast, about 0.6 nanoseconds per iteration, because the "jle skip" branch is always taken.  This makes the CPU branch predictor's job very easy
mov rax,0
mov rcx,0 ; loop counter
start:
	cmp rcx,1500
	jle skip
	add rax,rcx
	skip:
	add rcx,1
	cmp rcx,1000 ; loop runs 1000 times
	jle start
ret

(Try this in NetRun now!)

This loop is a little slower, 0.63 ns/iteration, because the branch predictor hits a hiccup when the branch switches from never hitting to always hitting:
mov rax,0
mov rcx,0 ; loop counter
start:
	cmp rcx,500
	jle skip
	add rax,rcx
	skip:
	add rcx,1
	cmp rcx,1000 ; loop runs 1000 times
	jle start
ret

(Try this in NetRun now!)

This loop runs at *half* the speed of the other two, 1.2 ns/iter, because it unpredictably switches from taken to not taken, meaning the branch predictor guesses wrong repeatedly.

mov rax,0
mov rcx,0 ; loop counter
start:
	mov rdx,rcx
	and rdx,0x35
	cmp rdx,0x13
	jle skip
	add rax,rcx
	skip:
	add rcx,1
	cmp rcx,1000 ; loop runs 1000 times
	jle start
ret

(Try this in NetRun now!)

(Try switching the compare value around; the worst case is about half taken, half not taken.)