Optimization: Loop Interchange & Floating-Point
CS 301 Lecture, Dr. Lawlor, 2005/11/14
Memory accesses have performance that's:
- Good *if* you're accessing a small buffer, because the buffer will be copied to fast cache memory.
- Good *if* you're accessing sequentially, because then the memory chips don't have to change rows.
- Terrible *if* you're accessing a big buffer (too big to fit in cache) in a nonsequential way.
So if you're getting bad performance, you can either:
- Make things smaller (to try to fit in cache)
- Make your accesses more regular (to get sequential access)
For example, this program has terrible performance-- 75ns/pixel:
(Executable NetRun Link)
enum {sx=512};
enum {sy=512};
double img[sx*sy];
int inc_img(void) {
for (int x=0;x<sx;x++)
for (int y=0;y<sy;y++)
img[x+y*sy]++;
return 0;
}
int foo(void) {
double t=time_function(inc_img);
printf("%.3f ns/pixel\n",t/(sx*sy)*1.0e9);
return 0;
}
We can speed the program up by either:
- Making "img" smaller, by changing it to a "char" (1 byte) instead
of "double" (8 bytes). This speeds the program up to 6ns/pixel
(try it!).
- Making the access more regular, by either interchanging the x and
y loops, or by changing how the array is indexed (so it's
"img[x*sy+y]"). This speeds the program up to 4ns/pixel.
- Doing both--small data with regular access. This speeds the program up to about 1ns/pixel--a 70x improvement!
Floating-Point Numbers
Floating-point numbers nowadays have roughly the same performance as
integers:
addition takes a little over a nanosecond (slightly slower than integer
addition); multiplication takes a few nanoseconds; and division takes a
dozen or more nanoseconds. That is, floats are now cheap, and you
can consider using floats for all sorts of stuff--even when you don't
care about fractions.
The sizes of the floating-point types are:
C Datatype
|
Size
|
Approx. Precision
|
Approx. Range
|
float
|
4 bytes (everywhere)
|
1.0x10-7
|
1038
|
double
|
8 bytes (everywhere)
|
2.0x10-15
|
10308
|
long double
|
12-16 bytes (if it exists)
|
2.0x10-20
|
104932
|
A "float" as as defined by IEEE Standard 754 consists of three bitfields:
Sign
|
Exponent
|
Mantissa (or Fraction)
|
1 bit--
0 for positive
1 for negative
|
8 bits--
127 means 20
137 means 210
|
23 bits-- a binary fraction.
|
The hardware interprets a float as having the value:
value = (-1) sign * 2 (exponent-127) * 1.fraction
Note that the mantissa has an implicit leading 1 applied (in
normal cases). There are other special interpretations if
the exponent is 0 (for 0 or denormalized numbers) or 255 (for
infinities and NaNs).
For example, the value "8" would be stored with sign bit 0, exponent
130 (==3+127), and mantissa 000... (without the leading 1), since:
8 = (-1) 0 * 2 (130-127) * 1.0000....
If we want to take the log-base-2 of a number, we can actually do it by
just looking at the exponent bits inside the "float" representation of
the number. Here's the code:
(Executable NetRun Link)
void print_log2(int i) {
float f=i; /* turn number into a float */
int v=*(int *)&f; /* look at bits of float representation */
int exponent=(v>>23)&0xff; /* extract IEEE exponent field */
exponent-=127; /* remove exponent bias */
printf(" i=%d float bit=0x%08X exp=%d\n",
i,v,exponent);
}
This is several times faster than the equivalent "(int)log(i)/log(2.0)", and it's totally portable.
See page 80 of the textbook for piles of examples and a more detailed definition.