Measuring Performance

CS 641 Lecture, Dr. Lawlor

When You Care about Performance

There are times when you don't care how fast your code runs.  Generally, a human being can't tell the difference between "this code responds in under 1 millisecond" and "this code responds in under 1 microsecond"; but a human *can* tell the difference between "here's an Excel spreadsheet and some perl scripts that will do the job" and "I've got a very fast C++ version that's almost working, just give me another week or two".  In other words, it probably doesn't matter if the once-a-week payroll processing takes two seconds or two hours to complete, as long as it's automated and it works.  Generally speaking,
But, there are also times when you do care about how fast your code runs.  For example, if you're writing "real-time" control code such as for an autonomous robotic car's braking system, then your code *can't* take two minutes to figure out that it's time to hit the brakes--the car will have already crashed by that point!  If you're writing a web server, your code *can't* take two minutes to respond to a web server request--the person sitting in front of the web browser will get bored and leave.  If you're writing the physics simulation or rendering code for a game, if your code takes even 1/10 of a second to respond to events, the game will feel noticably "laggy"; if it takes a second or more, it's not a game anymore, it's powerpoint!

Performance-Increasing Algorithm

Here's my general-purpose algorithm for increasing code performance:
  1. Figure out what's slow.
  2. Speed up the slow parts.
Part 1 is tricky, so people usually skip directly to part 2 with some random guess as to what's slow.  Occasionally, through sheer random chance, they'll get lucky and speed up the slow part of the code.  Usually, folks waste huge amounts of effort trying to speed up the wrong thing, get near-zero performance improvement, and give up.

For example, say a program consists of three parts: A, B, and C.  Part B takes an hour.  A and C take ten seconds each.  The program overall thus takes one hour twenty seconds.  If you spend a year working on A, and make it 10x faster, you've saved nine seconds and the program overall will take one hour eleven seconds--who cares!   If you spend a hour working on B, and make it 10% faster, you've saved six minutes, and the program now takes 54 minutes.  So speed up the slow parts first!

Figuring Out What's Slow (That is, Performance Analysis)

Your basic tool for figuring out what's slow is a timer.  This is just a function that returns the current real time ("wall-clock" time).   NetRun has a builtin function called "time_in_seconds", which returns a double giving the number of seconds.  Here's the implementation (from project/lib/inc.c):
/** Return the current time in seconds (since something or other). */
#if defined(WIN32)
# include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
struct _timeb t;
_ftime(&t);
return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
# include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
struct timeval tv;
gettimeofday(&tv,NULL);
return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif
(As usual, there's one version for Windows, and a different version for everything else.)

The usual way to use a timer like this is to call the timer before your code, call it after your code, and subtract the two times to get the "elapsed" time:
#include <fstream>
int foo(void) {
double t_before=time_in_seconds();
std::ofstream f("foo.dat"); /* code to be timed goes here */
double t_after=time_in_seconds();
double elapsed=t_after - t_before;
std::cout<<"That took "<<elapsed<<" seconds\n";
return 0;
}

(executable NetRun link)

The problems are that:
  1. time_in_seconds returns seconds, but it's only accurate to microseconds (not nanoseconds!)
  2. time_in_seconds takes about 800ns to return (which is considered pretty fast!)
  3. Your code might get interrupted by the OS, other processes, hardware events, etc
Problems 1 and 2 can be cured by running many iterations--"scaling up" many copies of the problem until it registers on your (noisy, quantized) timer.  So instead of doing this, which incorrectly claims "x+=y" takes 700ns:
int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

You'd do this, which shows the real time, 0.5ns!
int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

Problem 3 can be cured by running the above several times, and throwing out the anomalously high values (which were probably interrupted).

Your other main tool for performance analysis is a "profiler".   This is a library that keeps track of which function is running, and totals up the number of function calls and time spent in each one.  The "Profile" checkbox in NetRun will run a profiler, and show you the top functions it found in your code.

Both timing and profiling have some problems, though:
NetRun has a nice little built-in function called print_time that takes a string and a function.  It times the execution of (many iterations of) the function, then divides by the number of iterations to print out exactly how long each call to the function takes.  We'll be using print_time below to measure the speed of various operations.

Speeding It Up

The biggest gains in performance come from taking CS 311, the algorithms class.  I've actually personally worked hard on speeding up O(n^2) algorithms, and been happy with a 5x performance increase, until I realized there's an O(n lg(n)) algorithm which just instantly gave a 1000x performance increase!   O(n^2) is literally 30,000 times slower than O(n lg(n)) for n=1 million (and similar constants).  There's just no way to beat that kind of performance boost!

However, knowing a bit of assembly language can help you speed up your code as well.  Here are a few functions that time various assembly language instructions.
#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}

int foo(void) {
print_time("constant",fn_constant);
print_time("index",fn_index);
print_time("mov",fn_mov);
print_time("add",fn_add);
print_time("sub",fn_sub);
print_time("not",fn_not);
print_time("and",fn_and);
print_time("or",fn_or);
print_time("xor",fn_xor);
print_time("shl",fn_shl);
print_time("shr",fn_shr);
print_time("addshr",fn_addshr);
print_time("mul",fn_mul);
print_time("mad",fn_mad);
print_time("div",fn_div);
print_time("fmov",fn_fmov);
print_time("fadd",fn_fadd);
print_time("fmul",fn_fmul);
print_time("fmad",fn_fmad);
print_time("fdiv",fn_fdiv);
print_time("call",fn_call);
print_time("sqrt",fn_sqrt);
print_time("cos",fn_cos);
print_time("malloc",fn_malloc);
print_time("new",fn_new);
print_time("vector",fn_vector);
print_time("timer",fn_timer);
print_time("ofstream",fn_ofstream);
return 0;
}

(executable NetRun link)

Here's the output of this program on various machines.  Times are reported in nanoseconds, billionths of a second (1 ns = 10-9 seconds).
x86-32: Intel Pentium 4, 2.8GHz
ia64: Itanium, 800MHz
PowerPC: 604e, 200MHz
constant: 5.33 ns/call
index: 5.17 ns/call
mov: 5.09 ns/call
add: 5.16 ns/call
sub: 5.19 ns/call
not: 5.12 ns/call
and: 5.16 ns/call
or: 5.11 ns/call
xor: 5.16 ns/call
shl: 5.34 ns/call
shr: 5.12 ns/call
addshr: 5.05 ns/call
mul: 5.13 ns/call
mad: 5.56 ns/call
div: 16.78 ns/call
fmov: 5.14 ns/call
fadd: 8.99 ns/call
fmul: 9.33 ns/call
fmad: 10.92 ns/call
fdiv: 16.52 ns/call
call: 9.42 ns/call
sqrt: 17.28 ns/call
cos: 70.73 ns/call
malloc: 63.86 ns/call
new: 83.28 ns/call
vector: 161.34 ns/call
timer: 814.82 ns/call
ofstream: 13558.66 ns/call
constant: 45.14 ns/call
index: 48.92 ns/call
mov: 46.42 ns/call
add: 45.16 ns/call
sub: 45.15 ns/call
not: 55.17 ns/call
and: 45.15 ns/call
or: 45.17 ns/call
xor: 45.14 ns/call
shl: 45.14 ns/call
shr: 48.90 ns/call
addshr: 55.18 ns/call
mul: 65.22 ns/call
mad: 66.47 ns/call
div: 133.09 ns/call
fmov: 61.46 ns/call
fadd: 76.61 ns/call
fmul: 68.96 ns/call
fmad: 68.99 ns/call
fdiv: 132.03 ns/call
call: 54.99 ns/call
sqrt: 141.82 ns/call
cos: 156.80 ns/call
malloc: 421.63 ns/call
new: 441.96 ns/call
vector: 371.20 ns/call
timer: 990.96 ns/call
ofstream: 27243.05 ns/call
constant: 100.11 ns/call
index: 115.05 ns/call
mov: 100.11 ns/call
add: 105.12 ns/call
sub: 105.12 ns/call
not: 105.13 ns/call
and: 105.13 ns/call
or: 105.10 ns/call
xor: 105.12 ns/call
shl: 105.12 ns/call
shr: 105.12 ns/call
addshr: 105.10 ns/call
mul: 105.15 ns/call
mad: 110.05 ns/call
div: 195.22 ns/call
fmov: 155.09 ns/call
fadd: 175.23 ns/call
fmul: 175.11 ns/call
fmad: 175.20 ns/call
fdiv: 320.56 ns/call
call: 165.10 ns/call
sqrt: 781.37 ns/call
cos: 910.89 ns/call
malloc: 1120.97 ns/call
new: 1245.60 ns/call
vector: 617.67 ns/call
timer: 2903.34 ns/call
ofstream: 27621.16 ns/call
Here's what we can see generally:
In general, you can often replace expensive operations with cheaper operations.  For example, instead of doing a divide (10ns), you can do an integer bit shift (<1ns) or a floating-point multiply by reciprocal (x/y = x*(1.0/y)).  Instead of making a function call, use "inline" to have the compiler insert the machine code into the caller. Instead of calling "cos", you might eliminate the operation with an algebraic substitution (there are hundreds of useful trig identities), or replace it with a table lookup (since array indexing is very fast).  Instead of calling new/delete every iteration, you might reuse the allocated buffer and postpone the delete to the end of the program.  And anything you can do to avoid opening and closing files repeatedly will dramatically improve performance!