
Graphics processors have changed the last 10 years.
Nowadays, they are able to do more than just render some vertex on screen. We
will see first the difference between CPU and GPU. After, we will see how new
graphics cards are architectured. Finally, we will see how to use graphics cards like a co-processor.
The power required for computer games and for
3d-modeling in general increase drastically since few years. This piece of
silicon which was before optional is now one of the major part of the system.
At the beginning, graphics cards were just here to help CPU on specific task,
now they take more and more functionality.

This diagram shows the confrontation between CPU
and GPU on floating point operations.
The main market to
graphic card reseller is video game player. He is the only market which can buy
enough video card and which is ready to pay the same price a video card and a CPU.
It’s why the cost
of GPU stays reasonably low.
|
GPU |
$514 |
|
|
GPU |
$420 |
|
|
GPU |
$199 |
|
|
CPU |
$316 |
|
|
CPU |
$224 |
The price is also a
crucial element in the war. Why use SMP if my old graphic card could do better?
A CPU is expected to process a task as
fast as possible whereas a GPU must be capable of processing a maximum of tasks
on a large scale of data. The priority for the two is not the same, their
respective architectures show that point.
GPU increase the number of
processing units and the CPU develop control and expend his cache.

As we can see, GPU are highly parallel!
Memories are often a
limited factor for a system. CPU try to remove this limitation by expending the
size of cache memory.
|
|
Type |
Speed |
|
Nvidia |
GDR3 |
83.2 Go/s |
|
ATI |
GDR4 |
128.0 Go/s |
|
INTEL Core 2 duo |
DDR2 |
6,4 Go/s |
This trick
doesn’t work if you work on a large amount of data. GPUs are faster
memory certainly because new generation comes out every 6 months.
Accuracy is something
really important for scientific problem.
|
|
Floating Point Precision |
|
ATI GPU’s |
64 |
|
Nvidia GPU’s |
64 (32 on 8800) |
|
Processor |
128 (double on 64bits processors) |
The future generation
of GPU will exceed the accuracy of the CPU.
The 8800 is composed by
128 stream processor turning at the frequency of 1350 MHz each. A processor is able to
do an MAD and MUL calculation
per
clock cycle. They need 4
cycles for specials instructions like EXP, LOG, RCP, RSQ, SIN, COS managed by
an extra unit.
Ati
for the 2900XT chose another architecture. Instead using SIMD like Nvidia, they
use MIMD 5-way. That means
five instructions are dependant from each other. Each group of 5 processors has
a special unit able to handle special instructions. A Radeon HD 2900 can handle
320 simple operations or 256 simple + 64 special ones. The frequency is 742 only MHz.


These
two architectures show the new tendency of constructor. Multiply the unit to do
simpler calculus. Parallelize data at the maximum.
GPU are capable of reading and writing anywhere
in local memory (on the graphic card) or elsewhere (other parts of the system).
These memories, however, are not cached, and the cost of the latency of
reading/writing cycles for the GeForce 8800 oscillates between 200 and 300
cycles! This latency can be masked by the extremely long pipeline, if they
don’t wait for a reading instruction.


To avoid as much as possible access to global memory,
each multiprocessor has a small dedicated memory (16KB). They are called shared
memory because memory can be used by other processors in the same block.
General-purpose computing on graphics
processing units
(GPGPU, also referred to as GPGP and to a lesser extent GP²) is a recent
trend focused on using GPUs to perform computations rather than the CPU. The
addition of programmable stages and higher precision arithmetic to the
rendering pipelines allowed software developers using GPUs for non graphics
related applications. By exploiting GPU's extremely parallel architecture using
stream processing approaches many real-time computing problems can be sped up
considerably.
Vertex and pixel shader
were added to graphics pipeline to produce more realistic effect. The
specifications given by Microsoft increase the flexibility and capacity with
each revision.

This is why you
can nowadays run code on GPU!
Brook for GPUs is a compiler and runtime
implementation of the Brook stream program language for modern graphics
hardware. Brook is an extension
of standard ANSI C and is designed to incorporate the ideas of data parallel
computing and arithmetic intensity. It is a cross platform language able to run
on ATI and Nvidia, Linux and windows, DirextX and Opengl.
The main goal of this language is to make the
programming easier. Try to not use Graphic functions and simply the common
operation.