Performance-Improvement Examples

CS 301 Lecture, Dr. Lawlor

(Also, Chapter 5 of the Bryant and O'Hallaron textbook has a lot more info on performance and optimization)

So you need to make your code go faster. Here are the steps you go through to do this:

Understand what the code does. This is really important. There's a great story about a bunch of maintainance programmers that got handed a huge database system that was running too slow. They found this one piece of code that was running all the time, and doing a bunch of silly slow stuff, and they worked for a few months to speed it up. In the end, that part of the code got 10 times faster, but the overall database performance was exactly the same. Why? Because they'd sped up the "idle loop", the part of the program that only runs when nothing is happening!
Understand where the code spends its time. Usually, 99% of the code can be totally ignored--it runs once, or it's error-handling code, or it's not being used anymore. It's the remaining 1% of the code that takes 90% of the runtime. Find that 1% by understanding the code (step 1), using timers (see below), using debug statements (like std::cout<<"foobar.c is finally running!\n"), and using good profiling tools like gprof or VTune.
Use a better algorithm (like you cover in CS 311 and 411). You can't ever tweak bubblesort (an n-squared algorithm) to run faster than quicksort (an n-log(n) algorithm) for big arrays, so start your optimization session by switching to quicksort. Use search trees instead of brute-force searching a whole array; loop over the data once instead of a zillion times; and so on. The STL makes this sort of thing trivial, so do it. Stupid algorithms are a really common source of inefficiency; a good algorithm written sloppily will usually beat even the tightest implementation of a dumb algorithm. Joel on Software has an entertaining illustration of this with his Shlemeil the painter story. Shlemeil doesn't need to run faster in order to get more work done!
Let the compiler help you. Make constants really constant, so the compiler can lift them out of loops. Use "inline" to expose more subroutines to the optimizer.
Replace slow calls with fast calls. See the speed of stuff below.
Read the assembly code the compiler generates. If you're sure the compiler's optimized away all the "+=117" calls, but you see the constant 117 all over the assembly, you know it hasn't optimized it. If you wanted everything inlined, but the compiler's calling functions left and right, you know you need to make your code inlineable. If right in the middle of your fast inner loop, the compiler's putting in a call to "__moddi3", you need to figure out what the heck that is (a gcc compiler builtin routine), and how to get rid of it (use "int" instead of "long long", or don't use % on long longs).
If you can't get the compiler to spit out the code you want (which is rare!), write small bits of assembly code. It does take a fair amount of effort to beat the compiler, so this should not be your first choice!
If you can't write assembly code for all the cases you need to handle, consider some sort of dynamic code generation (coming before the end of the semester).

Understanding Time

So optimization is about reducing the amount of time your code uses. Step 1 of this process is understanding how to measure time.

Name	Second	Millisecond	Microsecond	Nanosecond
Abbrev.	1s	1ms	1us (or µs)	1ns
Duration	1.0e0s	1.0e-3s	1.0e-6s	1.0e-9s
Same as	1 heartbeat	1 bullet-foot	1 light-block	1 light-foot 1 cpu clock cycle 1 cpu instruction

Human scales are seconds or so. Your eyes can detect flashing from a CRT going at 50Hz refresh (50 cycles per second, or 0.02 seconds per cycle, or 20ms/cycle). You can't see flashing in a CRT at 100Hz refresh, so your eyes can't follow motions that are faster than 10ms or so. You can hear sounds up to 20 or 30 KHz, so your ears can detect air movements that take only 30us.

A mere 1GHz computer has a 1ns clock period. This means the central circuit that drives everything else in the CPU rises and falls once every nanosecond. Most machines can get a few instructions done per clock cycle, which is unbelievably fast!

What's Fast, and What's Slow?

1s	100ms	10ms	1ms	100us	10us	1us	100ns	10ns	1ns
Waiting for the human being	Reading or writing from CD, DVD, floppy Typical remote network latency	Reading or writing from Hard Drive		Typical local network latency fopen	cout cin	OS call	sqrt sin cos pow malloc new	/, % CALL	+, -, * &, \|, ^, <<, >> [i]

Example: function call overhead. Fix: "inline".

Calling a function takes some work--you've got to push the parameters, CALL and RET, clean up the stack, restore your registers, etc. Further, function calls usually trip up the compiler's optimizer. For a short function, it might be faster and less work to just fold the function body into the caller's code. You can ask the compiler to do this "function inlining" by adding the "inline" keyword before a short function. For example, this non-inlined code takes 22ns:

int do_silly_work(int a,int b) {
	return a+b/a;
}
int foo(void) {
	return do_silly_work(7,13);
}

(Try this in NetRun now!)

Adding one word, "inline", tells the compiler to insert the code for do_silly_work inside foo. Once it's there, the compiler's optimizer then has a chance to notice that a and b are both *constants*, so it can do the silly work at compile time. Overall, foo now takes just 4ns, a 5x speedup!

inline int do_silly_work(int a,int b) {
	return a+b/a;
}
int foo(void) {
	return do_silly_work(7,13);
}

(Try this in NetRun now!)

"inline" is really handy. I put "inline" in front of most of my short functions.

Example: parameter passing overhead. Fix: pass by pointer or reference.

C++ passes most parameters "by value", making a separate copy on the stack to hold the value. For "int", this is fast. For a class named "el_sluggo_maximus", this copy can slow your code down *dramatically*. If you see calls to the class constructor or "memcpy" (to copy the class's bytes onto the stack) in the caller's assembly, you know there's an extra copy of your class floating around!

The solution is to pass big parameters by pointer (like "el_sluggo_maximus *var") or reference (like "el_sluggo_maximus &var"), because then a tiny pointer is copied on the stack, not the huge class.

Note that arrays are always passed by pointer, not by value, so passing array parameters is always cheap.

Example: expensive arithmetic, like sqrt or divide. Fix: use another, cheaper operation.

Instead of
    int c = a%b;
just use the property of binary arithmetic (assuming b is a power of two) and use
    int c = a & (b-1);

Instead of
    int c = a / b;
just figure out the appropriate bit shift n (assuming b is a power of two again) and use
    int c = a >> n;

Note that if "b" is a constant known at compile time (e.g., an "enum" or a "const int"), the compiler will use both of the above tricks automatically.

In floating point, instead of
    float c = a / b;
if you can compute ib=(1.0/b) ahead of time, then it's faster to use a multiply-by-inverse:
    float c = a * ib;

For comparisons, instead of
    if (a/b > c) ...
then (assuming b is positive) you can multiply both sides by b, and compare
    if (a > c*b) ...

Similarly, instead of
    if (sqrt(a) < b) ...
you can just square both sides, and (assuming both sides are positive) use
    if (a < b*b) ...