Tradeoffs Between Big and Small Variables, and Loop Unrolling

CS 301 Lecture, Dr. Lawlor







Very Long
(not in C++!)





long long
2




int
2
4



short
2
4
8


Bytes  ->
(char)
2
4
8
16

Hex Digits ->
2
4
8
16
32
Bits ->
4
8
16
32
64
128
2 values
16 values
256
64KB
(segments)
4GB
16 EB
Big: 2.5 x 1038
<< and
>>
Hex constants
(0xf00d)
"sizeof",
pointers,
RAM & disk
MS-DOS
"word"
Normal code;
IPv4
New 64-bit
code
IPv6

The Intel 8008 8-bit machine was introduced back in 1972.  The 8080 16-bit machine was introduced in 1974, but its succesor the 8088 took off in the early 1980's with the IBM PC and MS-DOS.   16 bits is only enough to address 64KB of RAM, so MS-DOS used segmented memory to switch which 64KB you could access at any one time.  This required the use of "near" pointers (just a 16-bit offset) and "far" pointers (a 16-bit segment number plus a separate offset), didn't map onto C very well, and was in general a horrible pain.  So at great expense, during the 1990's we switched to 32-bit addressing, first introduced with the 80386.  This gets you to 4GB, which is now not quite enough, hence we're switching to 64-bit addressing.  It's been about one total rewrite per decade on microcomputers.

IBM's S/360 pointer size was 32 bits.  Back in 1964.  That was Very Big.  They didn't even worry about running out of bits until the late 1990's, thirty years later.

Now, IBM's i OS has a 128-bit pointer mode.  Because 16 exabytes (the 64-bit maximum) might not be enough!

The bottom line: big variables give you lots of headroom, which means you don't have to change things around very often.  That's good.

Performance: Small is Beautiful

For example, here's your typical "loop over array of integers":
enum {n=1000*1000};
int arr[n];

int foo(void) {
for (int i=0;i<n;i++) {
arr[i]++;
}
return 0;
}

(Try this in NetRun now!)

This takes about 2.3 million nanoseconds to loop over one million integers, and so takes 2.3 ns/operation.

This isn't bad--an integer is four bytes, so 2.3ns/4 bytes is just over 1/2 ns/byte, or just under 2 GB/sec.

But we can do better.  If you shrink the datatype you're operating on from 4-byte "int"s to 1-byte "char", then the CPU needs 1/4 as much memory bandwidth for the same number of operations!
enum {n=1000*1000};
char arr[n];

int foo(void) {
for (int i=0;i<n;i++) {
arr[i]++;
}
return 0;
}

(Try this in NetRun now!)

This is now 1.3 million nanoseconds for one million bytes, so 1.3 ns/operation.  Of course, a char can't hold numbers as big as an int, but not every problem needs huge numbers, and the performance increase is substantial.

But I claim performance is now limited by the loop overhead, not memory.  You can decrease the loop overhead with a technique called loop unrolling, where you create several copies of the body of the loop, and take fewer iterations:
enum {n=1000*1000};
char arr[n];

int foo(void) {
for (int i=0;i<n;i+=2) {
arr[i]++;
arr[i+1]++;
}
return 0;
}

(Try this in NetRun now!)

Because there are fewer compare and jump instructions, we're now down to 1.0ns/operation.  You can unroll more times, for more speed:
enum {n=1000*1000};
char arr[n];

int foo(void) {
for (int i=0;i<n;i+=4) {
arr[i]++;
arr[i+1]++;
arr[i+2]++;
arr[i+3]++;
}
return 0;
}

(Try this in NetRun now!)

Unrolling four times gets us to 0.9ns/operation.  You can unroll more, but it's tough to get much past 0.9ns/operation with pure unrolling--you quickly hit the point of "diminishing returns".  Most of the code is now pure memory accesses:
  c9:	80 80 00 00 00 00 01 	add    BYTE PTR [eax],0x1
d0: 80 80 01 00 00 00 01 add BYTE PTR [eax+0x1],0x1
d7: 80 80 02 00 00 00 01 add BYTE PTR [eax+0x2],0x1
de: 80 80 03 00 00 00 01 add BYTE PTR [eax+0x3],0x1
e5: 83 c0 04 add eax,0x4
e8: 3d 3f 42 0f 00 cmp eax,0xf423f
ed: 7e da jle c9 <foo+0x5>
We can improve this by noticing that we're just adding one to each of four bytes.  You can do this efficiently by "packing" all four bytes into one integer, and then adding 0x01010101 (which consists of four bytes, each with value one) to the integer version.  This is faster because a 4-byte integer addition takes about as long as one 1-byte char addition.  As usual, overflow would be a problem.
enum {n=1000*1000};
char arr[n];

int foo(void) {
for (int i=0;i<n;i+=4) {
*(int *)&arr[i] += 0x01010101;
}
return 0;
}

(Try this in NetRun now!)

This adds one to one million integers using only 1/4 million integer operations, and it only takes half a million nanoseconds, for a total time of 0.5ns/operation!

Since we're operating on integers here, it might make more sense to go back to an integer array, as long as we remember that each integer of the array stands for four separate char-sized variables:
enum {n=1000*1000}; /* number of items (chars) */
enum {ni=n/sizeof(int) }; /* number of integers */
int arr[ni];

int foo(void) {
for (int i=0;i<ni;i++) {
arr[i] += 0x01010101;
}
return 0;
}

(Try this in NetRun now!)


We can improve things even further if we go below byte sizes, to operate on individual hex digits.  As usual, you need to worry about overflow--doing this just sixteen times will cause carry-out from the individual hex fields, and major problems!
enum {n=1000*1000}; /* number of items (hex digits) */
enum {ni=n/(2*sizeof(int)) }; /* number of integers */
int arr[ni];

int foo(void) {
for (int i=0;i<ni;i++) {
arr[i] += 0x11111111;
}
return 0;
}

(Try this in NetRun now!)

This loops over 1 million hex digits in just 1/4 million nanoseconds, or just 0.25 ns/operation!