Making Code Faster
    
    CS 301 Lecture,
    Dr. Lawlor
    
    One of the biggest reasons to understand assembly language and
    computer architecture is performance.  
    What is Fast?
    Speed is in the eye of the beholder--in this case, the user. 
    But
    there's a big disconnect between computer speed and human speed:
    
    
      
      
        
          
            
              | Time | 
                 
 | Seconds | Examples | 
            
              | 31
                  years | Gs | 10e9 | Complete
                  life cycle of a successful multi-person software
                  project. North
                    America drifts a few feet. | 
            
              | 2
                  weeks | Ms | 10e6 | Write,
                  debug, and test a simple single-person program. Set
                  up billing for a commercial job. | 
            
              | 15
                  minutes | ks | 10e3 | Sysadmin
                  arrives to reboot server. Write
                  trivial script or one-off program. Shadow
                  of a 30 foot tall building travels one foot. | 
            
              | second | s | 1 | Humans can respond
                  to input. | 
            
              | millisecond | ms | 10e-3 | Send
                  data across a fast network. Start
                  seek on fast hard disk. Bullet
                  travels about 1 foot. | 
            
              | microsecond | us or μs | 10e-6 | Print
                  to the screen (printf or cout). Call
                  the operating system. Blink
                  an LED. | 
            
              | nanosecond | ns | 10e-9 | Execute
                  a few instructions. Call
                  a function. 1
                  clock cycle, at 1GHz clock rate. Light
                  travels about 1 foot. | 
          
        
       
    
    Because computers are so fast, and humans are so slow:
    
      - Often you can make things *seem* faster by starting them
        sooner.  Don't wait until the user clicks the "Run" button,
        you
        can start a preemptive calculation as soon as the mouse is
        *over* the
        button!
 
- For slow programs, often batch processing is more
        user-friendly.  In batch mode you first collect all the
        data
        you'll need, then compute it all; this lets the human wander
        away,
        rather than babysit an interactive program.
 
- Sometimes it's faster to run a slow program right now, than
        spend the time to write a fast program.  This means it
        might be rational to leave most of the program logic in
        Microsoft Excel.
 
- Sometimes it's faster overall to write a program in a slow
        language (like Python), than to write a more complex program in
        a fast language (like C++ or assembly).
Measure First
    The simplest tool for figuring out what's slow is a timer: a
    function
    that returns the current real time ("wall-clock"
    time).   It's a really bad idea to try to optimize code
    without being able to measure its performance, because intuition
    just
    isn't reliable (at least, mine isn't).  At least half of the
    "optimizations" I try either don't help measurably, or actually slow
    the program down!
    
    NetRun has a builtin function called
    "time_in_seconds", which returns a double giving the number of
    seconds.  Here's the implementation (from project/lib/inc.c):
    /** Return the current time in seconds (since something or other). */
#if defined(WIN32)
#  include <sys/timeb.h>
double time_in_seconds(void) { /* This seems to give terrible resolution (60ms!) */
        struct _timeb t;
        _ftime(&t);
        return t.millitm*1.0e-3+t.time*1.0;
}
#else /* UNIX or other system */
#  include <sys/time.h> //For gettimeofday time implementation
double time_in_seconds(void) { /* On Linux, this is microsecond-accurate. */
        struct timeval tv;
        gettimeofday(&tv,NULL);
        return tv.tv_usec*1.0e-6+tv.tv_sec*1.0;
}
#endif
    As usual, there's one version for Windows, and a different version
    for
    everything else.  There are TONS of other ways to get some
    notion
    of time in your programs:
    
      - double seconds=time_in_seconds();  Only exists in
        NetRun.  Return
        the number of seconds since *something*.  This is a double,
        implemented
        as above.
- long seconds=time(0);  
        Returns the number of seconds since January 1, 1970 (the
        notional birthday of UNIX!). 
        It's in <time.h>, and exists everywhere.  Sadly, this
        is an
        integer: good for calling "srand", but almost useless for timing
        fast
        code unless you are *very* patient.
- double
        seconds=clock()*(1.0/CLOCKS_PER_SEC);  Also standard, and
        from
        <time.h>.  CLOCKS_PER_SEC is like a million, but
        clock()
        seems to be quantized to milliseconds or worse on my
        machines. 
        Sadly,on most machines "clock()" returns CPU time, not wall
        clock
        time, so you can't time disk or network or anything non-CPU
        bound.  Also look out!  On a 32-bit machine, clock()
        may
        overflow and wraparound as often as every hour (4000 seconds * 1
        million CLOCKS_PER_SEC = 4 billion clocks...)
 
- The "rdtsc" instruction, like in the homework, returns the
        number
        of CPU clock cycles since the machine started up.  The
        cycle
        count's low bits go into eax, and the high bits go into
        edx.  The
        only problem with this is "clock throttling", where newer CPUs
        vary
        their clock speed (for energy efficiency when on battery power,
        or for
        thermal management, etc).  This variation shows up in
        rdtsc, and may
        vary between cores on a multi-core machine!
 
The usual way to use a timer like this is to call the timer before
    your
    code, call it after your code, and subtract the two times to get the
    "elapsed" time:
    #include <fstream>
int foo(void) {
	double t_before=time_in_seconds();
		std::ofstream f("foo.dat");  /* code to be timed goes here */
	double t_after=time_in_seconds();
	double elapsed=t_after - t_before;
	std::cout<<"That took "<<elapsed<<" seconds\n";
	return 0;
}
    (executable
        NetRun link)
    
    The problems are that:
    
      - time_in_seconds returns seconds, but it's only accurate to
        microseconds (not nanoseconds!)
- time_in_seconds takes about 800ns to return (which is
        considered pretty fast!)
- Your code might get interrupted by the OS, other processes,
        hardware events, etc
Problems 1 and 2 can be cured by running many iterations--"scaling
    up"
    many copies of the problem until it registers on your (noisy,
    quantized) timer.  So instead of doing this, which incorrectly
    claims "x+=y" takes 700ns:
    int x=3, y=2;
double t1=time_in_seconds();
x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)*1.0e9<<" ns\n";
return x;
    (Try
        this in NetRun now!)
    You'd do this, which shows the real time, 0.5ns!
    int x=3, y=2, n=1000000;
double t1=time_in_seconds();
for (int i=0;i<n;i++) x+=y;
double t2=time_in_seconds();
std::cout<<(t2-t1)/n*1.0e9<<" ns\n";
return x;
    (Try
        this in NetRun now!)
    
    Problem 3 can be cured by running the above several times, and
      throwing out the anomalously high values (which were probably
      interrupted).
    
    Your other main tool for performance analysis is a
    "profiler".   This is a library that keeps track of which
    function is running, and totals up the number of function calls and
    time spent in each one.  The "Profile" checkbox in NetRun will
    run
    a profiler, and show you the top functions it found in your code.
    
    Both timing and profiling have some problems, though:
    
      - Timing and profiling both take time themselves, which skews
        your
        measurement of the code.  This is often amusingly refered
        to as a
        Heisenberg effect--stuff behaves differently when you're
        watching it!
- Timing and profiling both fail on very short pieces of
        code.  Sometimes you'll have to repeat the code many times
        for it
        to take a measureable length of time. 
 
NetRun has a nice little built-in function called print_time that
    takes
    a string and a function.  It times the execution of (many
    iterations of) the function, then divides by the number of
    iterations
    to print out exactly how long each call to the function takes. 
    The "Time" checkbox in NetRun calls print_time on the default foo
    function.