Measuring Performance via the Scientific Method

When You Care about Performance

There are times when you don't care how fast your code runs. Generally, a human being can't tell the difference between "this code responds in under 1 millisecond" and "this code responds in under 1 microsecond"; but a human *can* tell the difference between "here's an Excel spreadsheet and some perl scripts that will do the job" and "I've got a very fast C++ version that's almost working, just give me another week or two". In other words, it probably doesn't matter if the once-a-week payroll processing takes two seconds or two hours to complete, as long as it's automated and it works. Generally speaking,

Even if you're sitting there waiting for it, a few seconds or less is usually "fast enough". The more rarely you have to do it, the slower it can be (think of rebooting, burning DVDs, etc.)
If it runs at midnight, as long as it finishes by the next morning, that's usually fast enough.

But, there are also times when you do care about how fast your code runs. For example, if you're writing "real-time" control code such as for an autonomous robotic car's braking system, then your code *can't* take two minutes to figure out that it's time to hit the brakes--the car will have already crashed by that point! If you're writing a web server, your code *can't* take two minutes to respond to a web server request--the person sitting in front of the web browser will get bored and leave. If you're writing the physics simulation or rendering code for a game, if your code takes even 1/10 of a second to respond to events, the game will feel noticably "laggy"; if it takes a second or more, it's not a game anymore, it's powerpoint!

Performance-Increasing Algorithm

Here's my general-purpose algorithm for increasing code performance:

Figure out what's slow.
Speed up the slow parts.

Part 1 is tricky, so people usually skip directly to part 2 with some random guess as to what's slow. Occasionally, through sheer random chance, they'll get lucky and speed up the slow part of the code. Usually, folks waste huge amounts of effort trying to speed up the wrong thing, get near-zero performance improvement, and give up.

For example, say a program consists of three parts: A, B, and C. Part B takes an hour. A and C take ten seconds each. The program overall thus takes one hour twenty seconds. If you spend a year working on A, and make it 10x faster, you've saved nine seconds and the program overall will take one hour eleven seconds--who cares! If you spend a hour working on B, and make it 10% faster, you've saved six minutes, and the program now takes 54 minutes. So speed up the slow parts first!

Timers

Your basic tool for figuring out what's slow is a timer. This is just a function that returns the current real time ("wall-clock" time). NetRun has a builtin function called "time_in_seconds", which returns a double giving the number of seconds. Here's the implementation (from project/lib/inc.c):

/** Return the current time in seconds (since, er, something or other). */
#if defined(WIN32)
#  include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
        struct _timeb t;
        _ftime(&t);
        return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
#  include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
        struct timeval tv;
        gettimeofday(&tv,NULL);
        return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif

(As usual, there's one version for Windows, and a different version for everything else.)

The usual way to use a timer like this is to call the timer before your code, call it after your code, and subtract the two times to get the "elapsed" time:

#include <fstream>
int foo(void) {
	double t_before=time_in_seconds();
		std::ofstream f("foo.dat");  /* code to be timed goes here */
	double t_after=time_in_seconds();
	double elapsed=t_after - t_before;
	std::cout<<"That took "<<elapsed<<" seconds\n";
	return 0;
}

(executable NetRun link)

Performance Experiments & the Scientific Method

It really pays to be completely scientific when you're doing performance experiments, because performance is one of those messy and imperfect phenomena that are tough to measure. One key trick is the "controlled experiment", where we compare some new code against a known "control" value. The first experiment to run is a "double control", where both test and control values are identical!

int test_beaker(void) {
	return 0;
}
int control_beaker(void) {
	return 0;
}
typedef int (*beaker_fn)(void);
double time_beaker(beaker_fn b) {
	double start=time_in_seconds();
	b(); /* run the beaker */
	double end=time_in_seconds();
	return (end-start);
}
int foo(void) {
	std::cout<<"Test beaker: "<<time_beaker(test_beaker)*1.0e9<<" ns\n";
	std::cout<<"Control beaker: "<<time_beaker(control_beaker)*1.0e9<<" ns\n";
	return 0;
}

(Try this in NetRun now!)

You can immediately see why this is useful when you run it. I get:

Test beaker: 3073.83 ns
Control beaker: 1022.71 ns
Program complete.  Return 0 (0x0)

Well, that doesn't make sense! The first function is substantially slower than the second function, even though both are identical! If you flip the order the functions are *defined*, you still find the test beaker is slower.

But if you flip the order the functions are *called*, you find the first function to be called is slower. This is a real effect--the first time a function is called, that memory might have to get loaded into your process by the operating system, the cache might not contain all the values, etc. So to really measure properly, you should repeat your experiments several times:

int foo(void) {
	for (int repeat=0;repeat<3;repeat++) {
		std::cout<<"Test beaker: "<<time_beaker(test_beaker)*1.0e9<<" ns\n";
		std::cout<<"Control beaker: "<<time_beaker(control_beaker)*1.0e9<<" ns\n";
	}
	return 0;
}

(Try this in NetRun now!)

Repeating the experiment eliminates the "first time penalty", but our times still jump around quite substantially. We can get substantially smoother results by repeating our measurement inside the timer loop, then dividing out that number of repetitions:

double time_beaker(beaker_fn b) {
	double start=time_in_seconds();
	int n=10000;
	for (int i=0;i<n;i++) b(); /* run the beaker many times */
	double end=time_in_seconds();
	return (end-start)/n;
}

(Try this in NetRun now!)

Timer Problems & Fixes

The problems with timers in general are that:

Quantization: time_in_seconds returns seconds, but it's only accurate to microseconds (not nanoseconds!)
Overhead: time_in_seconds takes about 800ns to return (which is considered pretty fast!)
Random noise: your code might get interrupted by the OS, other processes, hardware events, etc

Problems 1 and 2 can be cured by running many iterations--"scaling up" many copies of the problem until it registers on your (expensive, quantized) timer. So instead of doing this, which incorrectly claims the very fast operation "x+=y" takes 700ns:

int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

You'd do this, which shows the real time, 0.5ns!

int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

Problem 3 (random noise) can be cured by running the above several times, and throwing out the anomalously high values (which were probably interrupted). NetRun includes a builtin function called "time_function" that attempts to cure all three problems: it runs the function many times (to amortize away quantization and timer overhead), and returns the median time (to ignore unusually fast or slow runs).

/**
  Return the number of seconds this function takes to run.
  May run the function several times (to average out 
  timer granularity).
*/
double time_function_onepass(timeable_fn fn)
{
        unsigned int i,count=1;
        double timePer=0;
        for (count=1;count!=0;count*=2) { /* exponential search to figure out how many times to run fn */
                double start, end, elapsed;
                start=time_in_seconds();
                for (i=0;i<count;i++) fn();
                end=time_in_seconds();
                elapsed=end-start;
                timePer=elapsed/count;
                if (elapsed>time_in_seconds_granularity) /* Took long enough */
                        return timePer;
        }
        /* woa-- if we got here, "count" reached integer wraparound before 
          the timer ran long enough.  Return the last known time */
        return timePer;
}

/**
  Return the number of seconds this function takes to run.
  May run the function many times (to average out 
  timer granularity, and random OS noise).
*/
double time_function(timeable_fn fn)
{
        enum {ntimes=3}; /* run the code three times */
        double times[ntimes];
        for (int t=0;t<ntimes;t++)
                times[t]=time_function_onepass(fn);
        std::sort(&times[0],&times[ntimes]);
        return times[ntimes/2]; /* return the middle time (the median value) */
}

I highly recommend using time_function and print_time in your code! You're also welcome to copy these functions off somewhere for your own use.

Speeding Up Primitives

The biggest gains in performance come from taking CS 311, the algorithms class. I've actually personally worked hard on speeding up O(n^2) algorithms, and been happy with a 5x performance increase, until I realized there's an O(n lg(n)) algorithm which just instantly gave a 1000x performance increase! O(n^2) is literally 30,000 times slower than O(n lg(n)) for n=1 million (assuming the constants are similar). There's just no way to beat that kind of performance boost!

However, knowing a bit of assembly language can help you speed up your code as well. This program measures the time for various primitive operations.

#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}

int foo(void) {
	print_time("constant",fn_constant);
	print_time("index",fn_index);
	print_time("mov",fn_mov);
	print_time("add",fn_add);
	print_time("sub",fn_sub);
	print_time("not",fn_not);
	print_time("and",fn_and);
	print_time("or",fn_or);
	print_time("xor",fn_xor);
	print_time("shl",fn_shl);
	print_time("shr",fn_shr);
	print_time("addshr",fn_addshr);
	print_time("mul",fn_mul);
	print_time("mad",fn_mad);
	print_time("div",fn_div);
	print_time("fmov",fn_fmov);
	print_time("fadd",fn_fadd);
	print_time("fmul",fn_fmul);
	print_time("fmad",fn_fmad);
	print_time("fdiv",fn_fdiv);
	print_time("call",fn_call);
	print_time("sqrt",fn_sqrt);
	print_time("cos",fn_cos);
	print_time("malloc",fn_malloc);
	print_time("new",fn_new);
	print_time("vector",fn_vector);
	print_time("timer",fn_timer);
	print_time("ofstream",fn_ofstream);
	return 0;
}

(executable NetRun link)

Here's the output of this program on various machines. Times are reported in nanoseconds, billionths of a second (1 ns = 10^-9 seconds).

x86-32: Intel Pentium 4, 2.8GHz

ia64: Itanium, 800MHz

PowerPC: 604e, 200MHz

constant: 5.33 ns/call
index: 5.17 ns/call
mov: 5.09 ns/call
add: 5.16 ns/call
sub: 5.19 ns/call
not: 5.12 ns/call
and: 5.16 ns/call
or: 5.11 ns/call
xor: 5.16 ns/call
shl: 5.34 ns/call
shr: 5.12 ns/call
addshr: 5.05 ns/call
mul: 5.13 ns/call
mad: 5.56 ns/call
div: 16.78 ns/call
fmov: 5.14 ns/call
fadd: 8.99 ns/call
fmul: 9.33 ns/call
fmad: 10.92 ns/call
fdiv: 16.52 ns/call
call: 9.42 ns/call
sqrt: 17.28 ns/call
cos: 70.73 ns/call
malloc: 63.86 ns/call
new: 83.28 ns/call
vector: 161.34 ns/call
timer: 814.82 ns/call
ofstream: 13558.66 ns/call

constant: 45.14 ns/call
index: 48.92 ns/call
mov: 46.42 ns/call
add: 45.16 ns/call
sub: 45.15 ns/call
not: 55.17 ns/call
and: 45.15 ns/call
or: 45.17 ns/call
xor: 45.14 ns/call
shl: 45.14 ns/call
shr: 48.90 ns/call
addshr: 55.18 ns/call
mul: 65.22 ns/call
mad: 66.47 ns/call
div: 133.09 ns/call
fmov: 61.46 ns/call
fadd: 76.61 ns/call
fmul: 68.96 ns/call
fmad: 68.99 ns/call
fdiv: 132.03 ns/call
call: 54.99 ns/call
sqrt: 141.82 ns/call
cos: 156.80 ns/call
malloc: 421.63 ns/call
new: 441.96 ns/call
vector: 371.20 ns/call
timer: 990.96 ns/call
ofstream: 27243.05 ns/call

constant: 100.11 ns/call
index: 115.05 ns/call
mov: 100.11 ns/call
add: 105.12 ns/call
sub: 105.12 ns/call
not: 105.13 ns/call
and: 105.13 ns/call
or: 105.10 ns/call
xor: 105.12 ns/call
shl: 105.12 ns/call
shr: 105.12 ns/call
addshr: 105.10 ns/call
mul: 105.15 ns/call
mad: 110.05 ns/call
div: 195.22 ns/call
fmov: 155.09 ns/call
fadd: 175.23 ns/call
fmul: 175.11 ns/call
fmad: 175.20 ns/call
fdiv: 320.56 ns/call
call: 165.10 ns/call
sqrt: 781.37 ns/call
cos: 910.89 ns/call
malloc: 1120.97 ns/call
new: 1245.60 ns/call
vector: 617.67 ns/call
timer: 2903.34 ns/call
ofstream: 27621.16 ns/call

Here's what we can see generally:

Almost every integer arithmetic operation (+-*&|^~<<>>) takes almost exactly the same amount of time--under half a nanosecond on typical x86 machines! In fact, it's difficult to measure the cost of a single arithmetic operation, because the CPU can often overlap simple arithmetic with other instructions.
Array indexing is often very fast, under half a nanosecond.
The one exception is that divides (as well as mod) are pretty slow--over 10ns! This is way more than any other integer arithmetic operation.
Calling a function is pretty slow; about 5ns.
Floating point is a little more expensive than integer, a few nanoseconds per operation. Floating point divides and square roots are expensive, just like integer. Floating-point cosine is even more expensive--70ns!
Memory allocation, whether via malloc/free, new/delete, or std::vector, is very expensive--around 100ns!
Timer (and other operating system) calls are yet more expensive--800ns!
File operations are insanely expensive--tens of thousands of nanoseconds!
With slower-clock-rate CPUs, times of most operations scale up close to linearly (10x slower clock rate -> about 10x longer time).

In general, you can often replace expensive operations with cheaper operations. For example, instead of doing a divide (10ns), you can do an integer bit shift (<1ns) or a floating-point multiply by reciprocal (x/y = x*(1.0/y)). Instead of making a function call, use "inline" to have the compiler insert the machine code into the caller. Instead of calling "cos", you might eliminate the operation with an algebraic substitution (there are hundreds of useful trig identities), or replace it with a table lookup (since array indexing is very fast). Instead of calling new/delete every iteration, you might reuse the allocated buffer and postpone the delete to the end of the program. And anything you can do to avoid opening and closing files repeatedly will dramatically improve performance!