Using Timing data to Understand CPU Architecture

CS 441 Lecture, Dr. Lawlor

We saw in the last lecture notes how you can measure the performance impact of superscalar execution on a Pentium 4. It's also possible to measure the performance impact on other CPUs.

MIPS

Our MIPS R5000 CPU runs at 180MHz. That's 5.5ns per clock cycle. So this piece of code, which just loads two numbers into registers and adds them, takes a few dozen clock cycles at 130ns:

li $5,7
li $6,2
add $2,$5,$6
jr $31
nop

(Try this in NetRun now!)

The "nop" (a placeholder no-operation instruction) sitting after the "jr" (jump to register) instruction is there to fill the "branch delay slot". In MIPS, because the pipeline takes an extra clock cycle to decode branch instructions, they've defined the instruction set to just execute the instruction after a branch, even if the branch is taken. This means we can save one clock cycle by moving the add instruction down into the branch delay slot. This takes 125 ns, one cycle less than above:

li $5,7
li $6,2
jr $31
add $2,$5,$6

(Try this in NetRun now!)

If we add another instruction, like a "li $8,8" anytime before the branch, it takes one additional clock cycle. The extra time is needed even if the new instruction is totally unrelated to rest of the existing instructions, where a superscalar machine would clearly run the new non-dependent instruction in parallel for zero extra performance cost.

This happens because our MIPS CPU is *not* superscalar!

Figuring Out What's Slow (That is, Performance Analysis)

Your basic tool for figuring out what's slow is a timer. This is just a function that returns the current real time ("wall-clock" time). NetRun has a builtin function called "time_in_seconds", which returns a double giving the number of seconds. Here's the implementation (from project/lib/inc.c):

/** Return the current time in seconds (since something or other). */
#if defined(WIN32)
#  include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
        struct _timeb t;
        _ftime(&t);
        return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
#  include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
        struct timeval tv;
        gettimeofday(&tv,NULL);
        return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif

(As usual, there's one version for Windows, and a different version for everything else.)

The usual way to use a timer like this is to call the timer before your code, call it after your code, and subtract the two times to get the "elapsed" time:

#include <fstream>
int foo(void) {
	double t_before=time_in_seconds();
		std::ofstream f("foo.dat");  /* code to be timed goes here */
	double t_after=time_in_seconds();
	double elapsed=t_after - t_before;
	std::cout<<"That took "<<elapsed<<" seconds\n";
	return 0;
}

(executable NetRun link)

The problems are that:

time_in_seconds returns seconds, but it's only accurate to microseconds (not nanoseconds!)
time_in_seconds takes about 800ns to return (which is considered pretty fast!)
Your code might get interrupted by the OS, other processes, hardware events, etc

Problems 1 and 2 can be cured by running many iterations--"scaling up" many copies of the problem until it registers on your (noisy, quantized) timer. So instead of doing this, which incorrectly claims "x+=y" takes 700ns:

int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

You'd do this, which shows the real time, 0.5ns!

int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;

(Try this in NetRun now!)

Problem 3 can be cured by running the above several times, and throwing out the anomalously high values (which were probably interrupted).

Your other main tool for performance analysis is a "profiler". This is a library that keeps track of which function is running, and totals up the number of function calls and time spent in each one. The "Profile" checkbox in NetRun will run a profiler, and show you the top functions it found in your code.

Both timing and profiling have some problems, though:

Timing and profiling both take time themselves, which skews your measurement of the code. This is often amusingly refered to as a Heisenberg effect--stuff behaves differently when you're watching it!
Timing and profiling both fail on very short pieces of code. Sometimes you'll have to repeat the code many times for it to take a measureable length of time.

NetRun has a nice little built-in function called print_time that takes a string and a function. It times the execution of (many iterations of) the function, then divides by the number of iterations to print out exactly how long each call to the function takes. The "Time" checkbox in NetRun calls print_time on the default foo function.