Optimization: Mathematical Optimizations & Floating-Point

CS 301 Lecture, Dr. Lawlor, 2005/11/16

Replace Expensive Calls with Cheap Calls

Say you've got a bit of code like this:
(Executable NetRun Link)

/* A fourth-degree polynomial:
	a + b x + c x^2 + d x^3 + e x^4 
*/
class polynomial_4 {
public:
	double a,b,c,d,e;
	polynomial_4() :a(0.2),b(0.3),c(0.4),d(0.5),e(0.1) {}
	double evaluate(double x) {
		return a+
		       b*x+
		       c*pow(x,2)+
		       d*pow(x,3)+
		       e*pow(x,4);
	}
};

polynomial_4 poly;
double x=0.4;

int foo(void) {
	return (int)poly.evaluate(x);
}

It evaluates a polynomial. As written, it takes about 230ns per call, which is pretty darn slow.

The first thing to notice is that "pow" is slow--close to a hundred nanoseconds per call. So we can replace the call to pow with multiplications so evaluate reads:

		return a+
		       b*x+
		       c*x*x+
		       d*x*x*x+
		       e*x*x*x*x;

This brings the time down to 21ns per call--a speedup of over 10x!

Nesting & Multiply-by-Inverse

This looks like a lot of multiplies, so we might be tempted to do "nested polynomial evaluation", like this:

		return a+x*(b+x*(c+x*(d+x*e)));

This is indeed easier to write, but it's actually slightly slower--the compiler's already smart enough to optimize away the repeated multiplications into this form.

The compiler won't do the obvious thing and optimize a divide into a multiply-by-inverse, however:
(Executable NetRun Link )

enum {n=1000};
double vals[n];
double c=0.3;

int foo(void) {
	for (int i=0;i<n;i++)
		vals[i]=i/c;
	return (int)vals[100];
}

This runs in 16ns per element, which is pretty slow for such a simple routine. If we replace the guts with

	double ic=1.0/c;
	for (int i=0;i<n;i++)
		vals[i]=i*ic;

That is, instead of dividing by c, just multiply by 1/c! This now runs in 3ns per element, a 5x speedup.

Why doesn't the compiler do this? The reason is that because of floating-point roundoff, for some numbers, dividing and multiplying by the inverse are not the same thing!
(Executable NetRun Link )

float f=1000000000; 
volatile float f_i=1.0/f;  /* the "volatile" keeps the compiler honest */
printf(" f_i=%.20f\n",f_i);
printf(" f/f=%.10f; f*fi=%.10f\n",f/f,f*f_i);

On my machine, this prints out

  f  =1000000000.00000000000000000000
f/f  =1.0000000000
  f_i=0.00000000099999997172
f*f_i=0.9999999717

Because dividing and multiplying by the inverse aren't exactly the same, the compiler won't do it (unless, that is, you explicitly pass a "quick and dirty math" flag, like gcc's "-ffast-math"). But can you get away with it? Often, you certainly can--your input floating point numbers are usually inherently inexact anyway, so adding a small additional error doesn't affect the final results much.

Mathematical Transformations for Fun and Profit

Occasionally the compiler and runtime system will perform a rather bizarre transformation to avoid numerical problems. For example, the STL "complex" class represents complex numbers. A complex number's magnitude, or "absolute value" is just the square root of the sum of the squares of the real and imaginary components (from the Pythagorean theorem), so the obvious implementation of complex "abs" is:

abs(const complex<...> &z) {
      double x = z.real();
      double y = z.imag();
      return sqrt(x*x+y*y);
}

But the implementation in /usr/include/g++/complex is this:

    abs(const complex<double>& z)
    {
      double x = z.real();
      double y = z.imag();
      const double s = max(abs(x), abs(y));
      if (s == 0)  // well ...
        return 0;
      x /= s; 
      y /= s;
      return s * sqrt(x * x + y * y);
    }

Why the maddness? Well, say x=z.real() was really really big--like 10 to the 200th power, which (barely!) fits inside a "double". Then x*x would be even bigger--like 10 to the 400th power, which no longer fits inside a "double"! The result of an operation like this is strange, so instead the standard library does the weird scaling by "s", which costs two divides and a multiply (ouch!). The advantage is that the standard "abs" works for numbers of *any* size, even really close to the representable limit.

Of course, in reality, most programs are guaranteed to work with input much much smaller than the representable limit, so you can just use the obvious "abs" above. If you're comparing the result of the "abs" with something (e.g., to check the difference between two vectors), there's another trick--

	if (sqrt(x*x+y*y)>bar) ...

is exactly equivalent (by squaring both sides, which is OK because the numbers are positive) to

	if (x*x+y*y>bar*bar) ...

but this version avoids the expensive square root.