Measuring Performance
CS 641 Lecture, Dr. Lawlor
When You Care about Performance
There are times when you don't
care how fast your code runs.  Generally, a human being can't tell
the difference between "this code responds in under 1 millisecond" and
"this code responds in under 1 microsecond"; but a human *can* tell the
difference between "here's an Excel spreadsheet and some perl scripts
that will do the job" and "I've got a very fast C++ version that's almost
working, just give me another week or two".  In other words, it
probably doesn't matter if the once-a-week payroll processing takes two
seconds or two hours to complete, as long as it's automated and it
works.  Generally speaking, 
- Even if you're sitting there waiting for it, a few seconds or
less is usually "fast enough".  The more rarely you have to do it,
the slower it can be (think of rebooting, burning DVDs, etc.)
 
- If it runs at midnight, as long as it finishes by the next morning, that's usually fast enough.
 
But, there are also times when you do care about how fast your code
runs.  For example, if you're writing "real-time" control code
such as for an autonomous robotic car's
braking system, then your code *can't* take two minutes to figure out
that it's time to hit the brakes--the car will have already crashed by
that point!  If you're writing a web server, your code *can't*
take two minutes to respond to a web server request--the person sitting
in front of the web browser will get bored and leave.  If you're
writing the physics simulation or rendering code for a game, if your
code takes even 1/10 of a second to respond to events, the game will
feel noticably "laggy"; if it takes a second or more, it's not a game
anymore, it's powerpoint!
Performance-Increasing Algorithm
Here's my general-purpose algorithm for increasing code performance:
- Figure out what's slow.
- Speed up the slow parts.
Part 1 is tricky, so people usually skip directly to part 2 with some
random guess as to what's slow.  Occasionally, through sheer
random chance, they'll get lucky and speed up the slow part of the
code.  Usually, folks waste huge amounts of effort trying to speed
up the wrong thing, get near-zero performance improvement, and give up.
For example, say a program consists of three parts: A, B, and C. 
Part B takes an hour.  A and C take ten seconds each.  The
program overall thus takes one hour twenty seconds.  If you spend
a year working on A,
and make it 10x faster, you've saved nine seconds and the program
overall will take one hour eleven seconds--who cares!   If you
spend a hour working on B, and make it 10% faster, you've saved six
minutes, and the program now takes 54 minutes.  So speed up the
slow parts first!
Figuring Out What's Slow (That is, Performance Analysis)
Your basic tool for figuring out what's slow is a timer.  This is
just a function that returns the current real time ("wall-clock"
time).   NetRun has a builtin function called
"time_in_seconds", which returns a double giving the number of
seconds.  Here's the implementation (from project/lib/inc.c):
/** Return the current time in seconds (since something or other). */
#if defined(WIN32)
#  include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
        struct _timeb t;
        _ftime(&t);
        return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
#  include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
        struct timeval tv;
        gettimeofday(&tv,NULL);
        return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif
(As usual, there's one version for Windows, and a different version for everything else.)
The usual way to use a timer like this is to call the timer before your
code, call it after your code, and subtract the two times to get the
"elapsed" time:
#include <fstream>
int foo(void) {
	double t_before=time_in_seconds();
		std::ofstream f("foo.dat");  /* code to be timed goes here */
	double t_after=time_in_seconds();
	double elapsed=t_after - t_before;
	std::cout<<"That took "<<elapsed<<" seconds\n";
	return 0;
}
(executable NetRun link)
The problems are that:
  - time_in_seconds returns seconds, but it's only accurate to microseconds (not nanoseconds!)
- time_in_seconds takes about 800ns to return (which is considered pretty fast!)
- Your code might get interrupted by the OS, other processes, hardware events, etc
Problems 1 and 2 can be cured by running many iterations--"scaling up"
many copies of the problem until it registers on your (noisy,
quantized) timer.  So instead of doing this, which incorrectly
claims "x+=y" takes 700ns:
int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;
(Try this in NetRun now!)
You'd do this, which shows the real time, 0.5ns!
int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;
(Try this in NetRun now!)
Problem 3 can be cured by running the above several times, and
throwing out the anomalously high values (which were probably
interrupted).
Your other main tool for performance analysis is a
"profiler".   This is a library that keeps track of which
function is running, and totals up the number of function calls and
time spent in each one.  The "Profile" checkbox in NetRun will run
a profiler, and show you the top functions it found in your code.
Both timing and profiling have some problems, though:
- Timing and profiling both take time themselves, which skews your
measurement of the code.  This is often amusingly refered to as a
Heisenberg effect--stuff behaves differently when you're watching it!
- Timing and profiling both fail on very short pieces of
code.  Sometimes you'll have to repeat the code many times for it
to take a measureable length of time. 
 
NetRun has a nice little built-in function called print_time that takes
a string and a function.  It times the execution of (many
iterations of) the function, then divides by the number of iterations
to print out exactly how long each call to the function takes. 
We'll be using print_time below to measure the speed of various
operations.
Speeding It Up
The biggest gains in performance come from taking CS 311, the
algorithms class.  I've actually personally worked hard on
speeding up O(n^2) algorithms, and been happy with a 5x performance
increase, until I realized there's an O(n lg(n)) algorithm which just
instantly gave a 1000x performance increase!   O(n^2) is literally
30,000 times slower than O(n lg(n)) for n=1 million (and similar
constants).  There's just no way to beat that kind of performance
boost!
However, knowing a bit of assembly language can help you speed up your
code as well.  Here are a few functions that time various assembly
language instructions.
#include <fstream>
int a=10,b=11;
double x=2.3, y=3.4;
int array[100];
int fn_constant(void) {return 0;}
int fn_mov(void) {return a;}
int fn_index(void) {return array[a];}
int fn_add(void) {return a+b;}
int fn_sub(void) {return a-b;}
int fn_not(void) {return ~a;}
int fn_and(void) {return a&b;}
int fn_or (void) {return a|b;}
int fn_xor(void) {return a^b;}
int fn_shl(void) {return a<<b;}
int fn_shr(void) {return a>>b;}
int fn_addshr(void) {return (a+b)>>b;}
int fn_mul(void) {return a*b;}
int fn_mad(void) {return a*b+a;}
int fn_div(void) {return a/b;}
int fn_call(void) {return fn_constant();}
int fn_fmov(void) {return (int)(x);}
int fn_fadd(void) {return (int)(x+y);}
int fn_fmul(void) {return (int)(x*y);}
int fn_fmad(void) {return (int)(x*y+x);}
int fn_fdiv(void) {return (int)(x/y);}
int fn_sqrt(void) {return (int)sqrt(x);}
int fn_cos(void) {return (int)cos(x);}
int fn_malloc(void) {free(malloc(4)); return 0;}
int fn_new(void) {delete new int; return 0;}
int fn_vector(void) {std::vector<int> v(a,3); return 0;}
int fn_timer(void) {return (int)time_in_seconds();}
int fn_ofstream(void) {std::ofstream f("foo.dat"); return 0;}
int foo(void) {
	print_time("constant",fn_constant);
	print_time("index",fn_index);
	print_time("mov",fn_mov);
	print_time("add",fn_add);
	print_time("sub",fn_sub);
	print_time("not",fn_not);
	print_time("and",fn_and);
	print_time("or",fn_or);
	print_time("xor",fn_xor);
	print_time("shl",fn_shl);
	print_time("shr",fn_shr);
	print_time("addshr",fn_addshr);
	print_time("mul",fn_mul);
	print_time("mad",fn_mad);
	print_time("div",fn_div);
	print_time("fmov",fn_fmov);
	print_time("fadd",fn_fadd);
	print_time("fmul",fn_fmul);
	print_time("fmad",fn_fmad);
	print_time("fdiv",fn_fdiv);
	print_time("call",fn_call);
	print_time("sqrt",fn_sqrt);
	print_time("cos",fn_cos);
	print_time("malloc",fn_malloc);
	print_time("new",fn_new);
	print_time("vector",fn_vector);
	print_time("timer",fn_timer);
	print_time("ofstream",fn_ofstream);
	return 0;
}
(executable NetRun link)
Here's the output of this program on various machines.  Times are
reported in nanoseconds, billionths of a second (1 ns = 10-9 seconds).
  
    
      | x86-32: Intel Pentium 4, 2.8GHz 
 | ia64: Itanium, 800MHz 
 | PowerPC: 604e, 200MHz 
 | 
    
      | constant: 5.33 ns/callindex: 5.17 ns/call
 mov: 5.09 ns/call
 add: 5.16 ns/call
 sub: 5.19 ns/call
 not: 5.12 ns/call
 and: 5.16 ns/call
 or: 5.11 ns/call
 xor: 5.16 ns/call
 shl: 5.34 ns/call
 shr: 5.12 ns/call
 addshr: 5.05 ns/call
 mul: 5.13 ns/call
 mad: 5.56 ns/call
 div: 16.78 ns/call
 fmov: 5.14 ns/call
 fadd: 8.99 ns/call
 fmul: 9.33 ns/call
 fmad: 10.92 ns/call
 fdiv: 16.52 ns/call
 call: 9.42 ns/call
 sqrt: 17.28 ns/call
 cos: 70.73 ns/call
 malloc: 63.86 ns/call
 new: 83.28 ns/call
 vector: 161.34 ns/call
 timer: 814.82 ns/call
 ofstream: 13558.66 ns/call
 | constant: 45.14 ns/callindex: 48.92 ns/call
 mov: 46.42 ns/call
 add: 45.16 ns/call
 sub: 45.15 ns/call
 not: 55.17 ns/call
 and: 45.15 ns/call
 or: 45.17 ns/call
 xor: 45.14 ns/call
 shl: 45.14 ns/call
 shr: 48.90 ns/call
 addshr: 55.18 ns/call
 mul: 65.22 ns/call
 mad: 66.47 ns/call
 div: 133.09 ns/call
 fmov: 61.46 ns/call
 fadd: 76.61 ns/call
 fmul: 68.96 ns/call
 fmad: 68.99 ns/call
 fdiv: 132.03 ns/call
 call: 54.99 ns/call
 sqrt: 141.82 ns/call
 cos: 156.80 ns/call
 malloc: 421.63 ns/call
 new: 441.96 ns/call
 vector: 371.20 ns/call
 timer: 990.96 ns/call
 ofstream: 27243.05 ns/call
 | constant: 100.11 ns/callindex: 115.05 ns/call
 mov: 100.11 ns/call
 add: 105.12 ns/call
 sub: 105.12 ns/call
 not: 105.13 ns/call
 and: 105.13 ns/call
 or: 105.10 ns/call
 xor: 105.12 ns/call
 shl: 105.12 ns/call
 shr: 105.12 ns/call
 addshr: 105.10 ns/call
 mul: 105.15 ns/call
 mad: 110.05 ns/call
 div: 195.22 ns/call
 fmov: 155.09 ns/call
 fadd: 175.23 ns/call
 fmul: 175.11 ns/call
 fmad: 175.20 ns/call
 fdiv: 320.56 ns/call
 call: 165.10 ns/call
 sqrt: 781.37 ns/call
 cos: 910.89 ns/call
 malloc: 1120.97 ns/call
 new: 1245.60 ns/call
 vector: 617.67 ns/call
 timer: 2903.34 ns/call
 ofstream: 27621.16 ns/call
 
 | 
  
Here's what we can see generally:
- Almost every integer arithmetic operation
(+-*&|^~<<>>) takes almost exactly the same amount of
time--under half a nanosecond on typical x86 machines!  In fact,
it's difficult to measure the cost of a single arithmetic operation.
- Array indexing is often very fast, under half a nanosecond.
 
- The one exception is that divides are pretty slow--about
10ns!  This is way more than any other integer arithmetic
operation.
 
- Calling a function is pretty slow; about 5ns. 
 
- Floating point is a little more expensive than integer, a few
nanoseconds per operation.  Floating point divides and square
roots are expensive, just like integer.  Floating-point cosine is
even more expensive--70ns!
 
- Memory allocation, whether via malloc/free, new/delete, or std::vector, is very expensive--around 100ns!
- Timer (and other operating system) calls are yet more expensive--800ns!
- File operations are insanely expensive--tens of thousands of nanoseconds!
- With slower-clock-rate CPUs, times of most operations scale up
close to linearly (10x slower clock rate -> about 10x longer time).
 
In general, you can often replace expensive operations with cheaper
operations.  For example, instead of doing a divide (10ns), you
can do an integer bit shift (<1ns) or a floating-point multiply by
reciprocal (x/y = x*(1.0/y)).  Instead of making a function call,
use "inline" to have the compiler insert the machine code into the
caller. Instead of calling "cos", you might eliminate the operation
with an algebraic substitution (there are hundreds of useful trig identities),
or replace it with a table lookup (since array indexing is very
fast).  Instead of calling new/delete every iteration, you might
reuse the allocated buffer and postpone the delete to the end of the
program.  And anything you can do to avoid opening and closing
files repeatedly will dramatically improve performance!