CUDA float speedup

CUDA is NVIDIA's GPU programing platform, written in an extended version of C/C++ and compiled with "nvcc".

PyCUDA: A way to interface with CUDA in python programing language

Why use CUDA?
    We want to speed up a program that can take advantage of a lot of cores
        for example: A program that will calculate the sin of a array of floats
    Branching with floats
        no need to make a bit mask with simd
    Extremely well suited for number crunching. (can do floats and ints fast if you have a lot)
    If you have a large dataset to look / sort thro

Why not to use CUDA?
    If a program will do a lot of sequential things, and not repeating
        For example: Program needs to do a > b > c and each operation can not be parallelized, or is I/O dominated
    If a program will be using a lot of RAM (most cards max out at 1gig)
    If the program can not be parallelized
    If you have no NVIDIA based gpu (you can't run CUDA on an ATI gpu, or a multicore CPU)

Why PyCUDA?

    PyCUDA provides a lot of nice helper functions built in, and includes a lot of wrappers around the plain C style CUDA functions, making it simpler to implement.
    PyCUDA also provides a lot of common functions from CUDA in pre-bulit functions, see: http://kered.org/blog/2009-04-13/easy-python-numpy-cuda-cublas/


##All examples were compiled with CUDA 4.2, gcc 4.6.3, PyCUDA 2012.1, Ubuntu 12.04 using LXDE, and a EVGA 460 1GB ##

So lets make a simple program in CUDA
##code##
    ##Robert Avila Hartshorn
    ##PyCuda ex1.py
   
    import pycuda.gpuarray as gpuarray
    import pycuda.driver as cuda
    from pycuda.compiler import SourceModule
    import pycuda.autoinit
    import numpy
    import pycuda.driver as drv
    import math
    import sys

    ##CUDA CODE##
    #  This takes a float array, and calls 'cos' on every element in it
    #  very limited max size (number of threads)
    #
    #
    mod = SourceModule("""
      __global__ void cos_gpu(float *a)
      {
        int idx = threadIdx.x + threadIdx.y*4;
        a[idx] = cos(a[idx]) ;
      }
      """)

    #size = input("Enter a value where x^2 > 512:  ")
    #print "The matrix size is: ", size*size

    for size in range(1,32):            #this is to loop over all sizes... of same x,y, you can comment this out and uncomment the input it you want
        SIZE_OF_X = size
        SIZE_OF_Y = size

        if ((SIZE_OF_X*SIZE_OF_Y) > 1024):    #max number of threads is 1024 in CUDA v2.x+ cards, in V1.x cards reduce this to 512 between the x and y direction
            print "ERR WILL THIS IS BAD"
            sys.exit()            #this is just for if we got input from user

        ##These are for timing##
        secgpu = []
        seccpu = []

        #Time to try it on the GPU#

        for i in range(0,100):        ##this is to get a few runs in to avrage out times
            a = numpy.random.randn(SIZE_OF_X,SIZE_OF_Y).astype(numpy.float32)      #make a random number, as a float, in a 4*4 array
            a_cos = numpy.empty_like(a)                #make a new area that is zeroed out for the result
           
            ##Lets time this!##                #this area is for timing
            start = drv.Event()
            end = drv.Event()
            start.record()
            ##timing set up!#
   
            ##gpu running time!##

            a_gpu = cuda.mem_alloc(a.nbytes)            #allocate a memory size_of(a) on GPU
            cuda.memcpy_htod(a_gpu,a)                #copy a onto the device
            func = mod.get_function("cos_gpu")            #retreves the GPU function doublify, and stores it as func in python
            func(a_gpu, block=(SIZE_OF_X,SIZE_OF_Y,1))        #calls doublify with the float and block size of 4*4 matrix
            cuda.memcpy_dtoh(a_cos, a_gpu)                #copy memory off of the GPU
           

            ##end time!##
            end.record()
            end.synchronize()
            secs = start.time_till(end)
            secgpu.append(secs)
            ##end timing function##   
   
   
        #Time to try it on the CPU#
        for i in range(0,100):        ##this is to get a few runs in to avrage out times
            b = numpy.random.randn(SIZE_OF_X,SIZE_OF_Y).astype(numpy.float32)      #make a random number, as a float, in a 4*4 array
            b_cos = numpy.empty_like(a)
           
            ##Lets time this!##
            start = drv.Event()
            end = drv.Event()
            start.record()
            ##timing set up!#

            for i in range(0,SIZE_OF_X):
                for ii in range(0,SIZE_OF_Y):
                    b_cos[i][ii] =  math.cos(b[i][ii])        ##NOTE: I can use numpy implantation of cos to make this faster... (think they use the float trick we learned, but 256bit long with AVX? as my cpu can handle that, so a giant speedup there)
            ##numpy version is faster##
            #b_cos=numpy.cos(b)

            ##end time!##
            end.record()
            end.synchronize()
            secs = start.time_till(end)
            seccpu.append(secs)
            ##end time done calculatin##

        avggpu = sum(secgpu) / len(secgpu)
        avgcpu = sum(seccpu) / len(seccpu)
    #    print "Starting array:GPU "               
    #    print a                            #Prints the starting array (if you want to make sure it is doing what I say it is)
    #    print "A cos: "                           
    #    print a_cos                        #Prints the results

    #    print "Starting array:CPU "               
    #    print b                            #Prints the starting array (if you want to make sure it is doing what I say it is)
    #    print "A cos: "                           
    #    print b_cos                        #Prints the results
        print "Total number of calculated values: ",SIZE_OF_X*SIZE_OF_Y
        print "Avg time for on GPU:  ", avggpu
        print "Avg time for on CPU:  ", avgcpu
##endcode##

##OUTPUT## ##(for the last and first few)##
$ python ex1.py
Total number of calculated values:  1
Avg time for on GPU:   0.0527260803431
Avg time for on CPU:   0.00774591999128
Total number of calculated values:  4
Avg time for on GPU:   0.0865897596627
Avg time for on CPU:   0.00869727995247
Total number of calculated values:  9
Avg time for on GPU:   0.0542380798608
Avg time for on CPU:   0.0151286399923
Total number of calculated values:  16
Avg time for on GPU:   0.14174496226
Avg time for on CPU:   0.0201264000498
Total number of calculated values:  25
Avg time for on GPU:   0.0525241601095
Avg time for on CPU:   0.0358937600069
Total number of calculated values:  36
Avg time for on GPU:   0.0719817603752
Avg time for on CPU:   0.0387318400666

##Middle cut out for space##


Total number of calculated values:  784
Avg time for on GPU:   0.0605971200019
Avg time for on CPU:   0.63690208259
Total number of calculated values:  841
Avg time for on GPU:   0.0523747199401
Avg time for on CPU:   0.720301761578
Total number of calculated values:  900
Avg time for on GPU:   0.0531523199007
Avg time for on CPU:   0.72342432142
Total number of calculated values:  961
Avg time for on GPU:   0.0670486401767
Avg time for on CPU:   0.815556480202
##end output##

This was a simple example of PyCUDA, but as you can see not that much speedup (and a negative speed boost if you use numpy version of cos) ranging from about 0.14x (aka worse)  to 12x.

There is one major problem with this code, it can only scale out to 1024, as that is the max number of threads in a CUDA block (the x * y * z axies, and V2.x and higher compliant, if it is V1.x the max is 512, see:Version features and specifications on http://en.wikipedia.org/wiki/CUDA for more limitations if needed ).

So that is very limiting if we can only calculate out 1024 (or 512) items at once, so we need to increase this, and how we do that is we use blocks that it calculates on, each being less then (or equal to) the number of threads

ex2.py:
##code##
    ##Robert Avila Hartshorn
    ##PyCuda ex2.py
    import pycuda.gpuarray as gpuarray
    import pycuda.driver as cuda
    from pycuda.compiler import SourceModule
    import pycuda.autoinit
    import pycuda.tools
    import pycuda.cumath
    from pycuda.elementwise import ElementwiseKernel
    import numpy
    import pycuda.driver as drv
    import math
    import sys


    ##Based off of http://wiki.tiker.net/PyCuda/Examples/SimpleSpeedTest

    ##CUDA CODE##
    #  This takes a float array, and calls 'cos' on every element in it
    #  NOTE THE C CODE
    #
    mod = SourceModule("""
        __global__ void cos_gpu_v2(float *out, float *a)
        {
/***************
*This is the big change in the CUDA code, we made it so that our index counts for what block number we are in.
*Also we have to use c/c++ style comments here, not python
*************/
            const int idx = blockDim.x*blockIdx.x+threadIdx.x; 
               a[idx] = cos(a[idx]);
            out[idx] = a[idx];
          }
      """)



    #size = input("Enter a value for block size:  ")    #for manual input
    block_range = [64,128,256,512,1024,2048,4096]    #to test block sizes of power of 2, no need for this to be power of 2
    for blocks in block_range:
        block_size = 128            #can be 1-max thread number (1028 or 512 for V2+ or V1 )
        nbr_tot = blocks * block_size        #total items we will calculate

        print "Total number of elements: ", nbr_tot
       
        ## for timing ##
        secgpu = []
        seccpu = []


        #Time to try it on the GPU#
        for i in range(0,100):        ##this is to get a few runs in to avrage out times
            a = numpy.random.randn(nbr_tot).astype(numpy.float32)      #make a random number, as a float, in a 4*4 array
            a_cos = numpy.empty_like(a)       
            ##Lets time this!##
            start = drv.Event()
            end = drv.Event()
            start.record()
            ##timing set up!#

   
            func = mod.get_function("cos_gpu_v2")            #retrieves the GPU function doublify, and stores it as func in python
            func(cuda.Out(a_cos),cuda.In(a), grid=(blocks,1),block=(block_size,1,1))        #calls the function, and this time we are not doing distinct separate calls to put data onto the gpu or pull it off, but rather using the nicer PyCUDA wrapper around it


            ##end time!##
            end.record()
            end.synchronize()
            secs = start.time_till(end)
            secgpu.append(secs)
            ##timer end##

        #Time to try it on the CPU#
        for i in range(0,100):        ##this is to get a few runs in to avrage out times
            b = numpy.random.randn(nbr_tot).astype(numpy.float32)      #make a random number, as a float, in a 4*4 array
            b_cos = numpy.empty_like(a)
            ##Lets time this!##
            start = drv.Event()
            end = drv.Event()
            start.record()
            ##timing set up!#
            #for i in range(0,nbr_tot):        ##for if you want to do the slow version still, but at these numbers even the numpy version is faster
            #    b_cos[i] =  math.cos(b[i])
            b_cos = numpy.cos(b)
       

            ##end time!##
            end.record()
            end.synchronize()
            secs = start.time_till(end)
            seccpu.append(secs)
            ##timer end##

        print "The first four elements in the"
        print "GPU"           
        for i in range(0,5):   
            print "Value is: ",a[i],"  And the cos is: ", a_cos[i]        ##print the first 5 to show that it is truly doing what I say it is
        print "CPU"
        for i in range(0,5):   
            print "Value is: ",b[i],"  And the cos is: ", b_cos[i]        ##print the first 5 to show that it is truly doing what I say it is
        print "Avg time for on GPU"
        avggpu = sum(secgpu) / len(secgpu)
        print avggpu
        print "Avg time for on CPU"
        avgcpu = sum(seccpu) / len(seccpu)
        print avgcpu



##OUTPUT##  ##last and first 2##

$ python ex2.py
Total number of elements:  8192
The first four elements in the
GPU
Value is:  2.11672   And the cos is:  -0.51921
Value is:  0.820839   And the cos is:  0.681608
Value is:  -0.86604   And the cos is:  0.647848
Value is:  -1.19585   And the cos is:  0.366222
Value is:  1.57808   And the cos is:  -0.0072851
CPU
Value is:  -0.587186   And the cos is:  0.832503
Value is:  0.785571   And the cos is:  0.706985
Value is:  -1.76461   And the cos is:  -0.192603
Value is:  -0.405069   And the cos is:  0.919075
Value is:  0.632074   And the cos is:  0.806804
Avg time for on GPU:   0.341869118065
Avg time for on CPU:   0.127962560654
Total number of elements:  16384
The first four elements in the
GPU
Value is:  0.165707   And the cos is:  0.986302
Value is:  1.1886   And the cos is:  0.372963
Value is:  1.08671   And the cos is:  0.465397
Value is:  -0.568446   And the cos is:  0.842739
Value is:  0.909357   And the cos is:  0.614253
CPU
Value is:  1.29302   And the cos is:  0.274219
Value is:  -0.0889364   And the cos is:  0.996048
Value is:  0.724472   And the cos is:  0.74885
Value is:  0.0199612   And the cos is:  0.999801
Value is:  -0.158194   And the cos is:  0.987513
Avg time for on GPU:   0.346524159014
Avg time for on CPU:   0.255263360888

##snip out middle##


Total number of elements:  262144
The first four elements in the
GPU
Value is:  0.563329   And the cos is:  0.845482
Value is:  0.107945   And the cos is:  0.99418
Value is:  0.00656393   And the cos is:  0.999978
Value is:  -1.7701   And the cos is:  -0.197987
Value is:  -0.491755   And the cos is:  0.881505
CPU
Value is:  2.65379   And the cos is:  -0.883363
Value is:  0.571239   And the cos is:  0.841232
Value is:  2.13526   And the cos is:  -0.534959
Value is:  1.69509   And the cos is:  -0.123977
Value is:  -0.0937435   And the cos is:  0.995609
Avg time for on GPU:   1.18923168063
Avg time for on CPU:   3.9944508934
Total number of elements:  524288
The first four elements in the
GPU
Value is:  0.473183   And the cos is:  0.890122
Value is:  0.791383   And the cos is:  0.702862
Value is:  0.478855   And the cos is:  0.887523
Value is:  -0.00359091   And the cos is:  0.999994
Value is:  0.226169   And the cos is:  0.974533
CPU
Value is:  0.138571   And the cos is:  0.990414
Value is:  0.121317   And the cos is:  0.99265
Value is:  0.36968   And the cos is:  0.932443
Value is:  -2.42157   And the cos is:  -0.751791
Value is:  0.667079   And the cos is:  0.785632
Avg time for on GPU:   1.68935103178
Avg time for on CPU:   8.17050975323
##end output

Speedup ranged from: 0.37x (worse, very few elements) to 4.8x faster

As you can see there is some overhead (loading memory onto the GPU can be slow) but the more you get the bigger the speedup.  Also the complexity of the problem did not increase a lot (and some of it was simpler due to the fact that I was using better functions to get memory on and off the unit).

Branching


Now lets try and get something that is more complex in: BRANCHING!

##this is the same code except I can't use the nice numpy and there is branching, so omitting most of the repeated code


mod = SourceModule("""
    __global__ void cos_gpu_v2(float *out, float *a)
    {
        const int idx = blockDim.x*blockIdx.x+threadIdx.x;
        if(a[idx] > 0.5 || a[idx] < -0.5)            ##lets branch if the value is not  -0.5>a[idx]>0.5 to do sin instead
               a[idx] = cos(a[idx]);
        else
            a[idx] = sin(a[idx]);
        out[idx] = a[idx];
      }
  """)

    ##note: I only did 64 and 128 blocks with the CPU implamatation, due to speed issues
    for i in range(0,nbr_tot):        #for(int i = 0;i>nbr_tot;i++) is the equlivent
        if(b[i] > 0.5 or b[i] < -0.5):
            b_cos[i] = numpy.cos(b[i])
        else:
            b_cos[i] = numpy.sin(b[i])

##output##

$ python ex3.py
Total number of elements:  8192
The first four elements in the
GPU
Value is:  1.03226   And the sin/cos is:  0.512879
Value is:  1.38142   And the sin/cos is:  0.188243
Value is:  -0.800676   And the sin/cos is:  0.696221
Value is:  0.265616   And the sin/cos is:  0.262504
Value is:  0.888949   And the sin/cos is:  0.630228
CPU
Value is:  1.4849   And the sin/cos is:  0.0857926
Value is:  0.512424   And the sin/cos is:  0.871558
Value is:  1.93362   And the sin/cos is:  -0.354911
Value is:  -2.25762   And the sin/cos is:  -0.634087
Value is:  0.542989   And the sin/cos is:  0.856168
Avg time for on GPU
0.377027838081
Avg time for on CPU
75.8683939362


Total number of elements:  16384
The first four elements in the
GPU
Value is:  0.0619954   And the sin/cos is:  0.0619557
Value is:  0.175449   And the sin/cos is:  0.17455
Value is:  0.0701176   And the sin/cos is:  0.0700602
Value is:  2.40369   And the sin/cos is:  -0.739879
Value is:  -2.93201   And the sin/cos is:  -0.978118
CPU
Value is:  -0.295365   And the sin/cos is:  -0.291089
Value is:  -0.312832   And the sin/cos is:  -0.307755
Value is:  -0.793606   And the sin/cos is:  0.701279
Value is:  -0.0676001   And the sin/cos is:  -0.0675487
Value is:  1.58798   And the sin/cos is:  -0.0171872
Avg time for on GPU
0.359288320541
Avg time for on CPU
154.235956268


Total number of elements:  262144
The first four elements in the
GPU
Value is:  -0.49059   And the sin/cos is:  -0.471146
Value is:  0.601886   And the sin/cos is:  0.824269
Value is:  -0.406498   And the sin/cos is:  -0.395395
Value is:  1.2994   And the sin/cos is:  0.268075
Value is:  0.167874   And the sin/cos is:  0.167086
Avg time for on GPU
1.39188000083


Total number of elements:  524288
The first four elements in the
GPU
Value is:  -2.3225   And the sin/cos is:  -0.682887
Value is:  1.30079   And the sin/cos is:  0.266742
Value is:  0.532119   And the sin/cos is:  0.861734
Value is:  1.03381   And the sin/cos is:  0.511548
Value is:  0.650671   And the sin/cos is:  0.795677
Avg time for on GPU
1.74932351589


Total number of elements:  1048576
The first four elements in the
GPU
Value is:  -0.552435   And the sin/cos is:  0.851249
Value is:  0.383028   And the sin/cos is:  0.373731
Value is:  0.362949   And the sin/cos is:  0.355033
Value is:  -1.17558   And the sin/cos is:  0.385008
Value is:  -0.0931009   And the sin/cos is:  -0.0929664
Avg time for on GPU
2.73544128656

##endoutput##

The speedup here starts to get to be big!  with the small array being about 200x faster! (and... I did not want to wait arround for the larger ones)


So as we can see with this, when we start to introduce branching we can get a BIG speedup, due to the fact that it is running one float per a core at a time, and even with the data amount doubling each time, the amount of time increasing is not double, but increasing slowly (log?) rate, making things that can be a lot of math intensive stuff get a speedup with not a lot of work.