Optimization: Mathematical Optimizations & Floating-Point
CS 301 Lecture, Dr. Lawlor, 2005/11/16
Replace Expensive Calls with Cheap Calls
Say you've got a bit of code like this:
(Executable NetRun Link)
/* A fourth-degree polynomial:
a + b x + c x^2 + d x^3 + e x^4
*/
class polynomial_4 {
public:
double a,b,c,d,e;
polynomial_4() :a(0.2),b(0.3),c(0.4),d(0.5),e(0.1) {}
double evaluate(double x) {
return a+
b*x+
c*pow(x,2)+
d*pow(x,3)+
e*pow(x,4);
}
};
polynomial_4 poly;
double x=0.4;
int foo(void) {
return (int)poly.evaluate(x);
}
It evaluates a polynomial. As written, it takes about 230ns per call,
which is pretty darn slow.
The first thing to notice is that "pow" is slow--close to a hundred
nanoseconds per call. So we can replace the call to pow with
multiplications so evaluate reads:
return a+
b*x+
c*x*x+
d*x*x*x+
e*x*x*x*x;
This brings the time down to 21ns per call--a speedup of over 10x!
Nesting & Multiply-by-Inverse
This looks like a lot of multiplies, so we might be tempted to do "nested polynomial evaluation", like this:
return a+x*(b+x*(c+x*(d+x*e)));
This is indeed easier to write, but it's actually slightly slower--the
compiler's already smart enough to optimize away the repeated
multiplications into this form.
The compiler won't do the obvious thing and optimize a divide into a multiply-by-inverse, however:
(Executable NetRun Link)
enum {n=1000};
double vals[n];
double c=0.3;
int foo(void) {
for (int i=0;i<n;i++)
vals[i]=i/c;
return (int)vals[100];
}
This runs in 16ns per element, which is pretty slow for such a simple routine. If we replace the guts with
double ic=1.0/c;
for (int i=0;i<n;i++)
vals[i]=i*ic;
That is, instead of dividing by c, just multiply by 1/c! This now runs in 3ns per element, a 5x speedup.
Why doesn't the compiler do this? The reason is that because of
floating-point roundoff, for some numbers, dividing and multiplying by
the inverse are not the same thing!
(Executable NetRun Link)
float f=1000000000;
volatile float f_i=1.0/f; /* the "volatile" keeps the compiler honest */
printf(" f_i=%.20f\n",f_i);
printf(" f/f=%.10f; f*fi=%.10f\n",f/f,f*f_i);
On my machine, this prints out
f =1000000000.00000000000000000000
f/f =1.0000000000
f_i=0.00000000099999997172
f*f_i=0.9999999717
Because dividing and multiplying by the inverse aren't exactly the
same, the compiler won't do it (unless, that is, you explicitly pass a
"quick and dirty math" flag, like gcc's "-ffast-math"). But can
you get away with it? Often, you certainly can--your input
floating point numbers are usually inherently inexact anyway, so adding
a small additional error doesn't affect the final results much.
Mathematical Transformations for Fun and Profit
Occasionally the compiler and runtime system will perform a rather
bizarre transformation to avoid numerical problems. For example,
the STL "complex" class represents complex numbers. A complex
number's magnitude, or "absolute value" is just the square root of the
sum of the squares of the real and imaginary components (from the
Pythagorean theorem), so the obvious implementation of complex "abs" is:
abs(const complex<...> &z) {
double x = z.real();
double y = z.imag();
return sqrt(x*x+y*y);
}
But the implementation in /usr/include/g++/complex is this:
abs(const complex<double>& z)
{
double x = z.real();
double y = z.imag();
const double s = max(abs(x), abs(y));
if (s == 0) // well ...
return 0;
x /= s;
y /= s;
return s * sqrt(x * x + y * y);
}
Why the maddness? Well, say x=z.real() was really really
big--like 10 to the 200th power, which (barely!) fits inside a
"double". Then x*x would be even bigger--like 10 to the 400th
power, which no longer fits inside a "double"! The result of an
operation like this is strange, so instead the standard library does
the weird scaling by "s", which costs two divides and a multiply
(ouch!). The advantage is that the standard "abs" works for
numbers of *any* size, even really close to the representable limit.
Of course, in reality, most programs are guaranteed to work with input
much much smaller than the representable limit, so you can just use the
obvious "abs" above. If you're comparing the result of the "abs"
with something (e.g., to check the difference between two vectors),
there's another trick--
if (sqrt(x*x+y*y)>bar) ...
is exactly equivalent (by squaring both sides, which is OK because the numbers are positive) to
if (x*x+y*y>bar*bar) ...
but this version avoids the expensive square root.