Optimization: Loop Interchange & Floating-Point

CS 301 Lecture, Dr. Lawlor, 2005/11/14

Memory accesses have performance that's:

Good *if* you're accessing a small buffer, because the buffer will be copied to fast cache memory.
Good *if* you're accessing sequentially, because then the memory chips don't have to change rows.
Terrible *if* you're accessing a big buffer (too big to fit in cache) in a nonsequential way.

So if you're getting bad performance, you can either:

Make things smaller (to try to fit in cache)
Make your accesses more regular (to get sequential access)

For example, this program has terrible performance-- 75ns/pixel:
(Executable NetRun Link)

enum {sx=512};
enum {sy=512};
double img[sx*sy];

int inc_img(void) {
        for (int x=0;x<sx;x++)
	for (int y=0;y<sy;y++)
		img[x+y*sy]++;
	return 0;
}

int foo(void) {
	double t=time_function(inc_img);
	printf("%.3f ns/pixel\n",t/(sx*sy)*1.0e9);
	return 0;
}

We can speed the program up by either:

Making "img" smaller, by changing it to a "char" (1 byte) instead of "double" (8 bytes). This speeds the program up to 6ns/pixel (try it!).
Making the access more regular, by either interchanging the x and y loops, or by changing how the array is indexed (so it's "img[x*sy+y]"). This speeds the program up to 4ns/pixel.
Doing both--small data with regular access. This speeds the program up to about 1ns/pixel--a 70x improvement!

Floating-Point Numbers

Floating-point numbers nowadays have roughly the same performance as integers: addition takes a little over a nanosecond (slightly slower than integer addition); multiplication takes a few nanoseconds; and division takes a dozen or more nanoseconds. That is, floats are now cheap, and you can consider using floats for all sorts of stuff--even when you don't care about fractions.

The sizes of the floating-point types are:

C Datatype	Size	Approx. Precision	Approx. Range
float	4 bytes (everywhere)	1.0x10^-7	10³⁸
double	8 bytes (everywhere)	2.0x10^-15	10³⁰⁸
long double	12-16 bytes (if it exists)	2.0x10^-20	10⁴⁹³²

A "float" as as defined by IEEE Standard 754 consists of three bitfields:

Sign	Exponent	Mantissa (or Fraction)
1 bit-- 0 for positive 1 for negative	8 bits-- 127 means 2⁰ 137 means 2¹⁰	23 bits-- a binary fraction.

The hardware interprets a float as having the value:

    value = (-1) ^sign * 2 ^{(exponent-127)}* 1.fraction

Note that the mantissa has an implicit leading 1 applied (in normal cases).   There are other special interpretations if the exponent is 0 (for 0 or denormalized numbers) or 255 (for infinities and NaNs).

For example, the value "8" would be stored with sign bit 0, exponent 130 (==3+127), and mantissa 000... (without the leading 1), since:

    8 = (-1) ⁰ * 2 ^(130-127)* 1.0000....

If we want to take the log-base-2 of a number, we can actually do it by just looking at the exponent bits inside the "float" representation of the number. Here's the code:
(Executable NetRun Link)

void print_log2(int i) {
	float f=i; /* turn number into a float */
	int v=*(int *)&f; /* look at bits of float representation */
	int exponent=(v>>23)&0xff; /* extract IEEE exponent field */
	exponent-=127; /* remove exponent bias */
	printf(" i=%d   float bit=0x%08X   exp=%d\n",
		i,v,exponent);
}

This is several times faster than the equivalent "(int)log(i)/log(2.0)", and it's totally portable.

See page 80 of the textbook for piles of examples and a more detailed definition.