Comparing GPUs to CPUs in a particular situation

Say I have an embarrassingly parallel problem, which can be expressed in the SIMD model. Also, all the data needed can fit in core’s registers, thus eliminating memory latency issues. I want to compare my GPU to my CPU in terms of computation performance.

I use the following simple program:

int sum = 0;

        for (int i = 0; i < count; ++i) {

                sum += i;

        }

        *result = sum;

The compiler will place i, sum and count in the registers, right? Now I test this on my CPU, which happens to be Core-i7 720QM (Quad-core, 1.6GHz w/o TurboBoost, 2.8GHz with TurboBoost). It takes about 2.7 seconds to execute for a given count, so the compiler in not optimising the loop away. Also, TurboBoost has been turned off from the BIOS. Now, in an ideal world, lets say I can distribute this across my four cores and achieve 2.7/4 = 0.675 seconds execution time. Now I test this on a single GPU core (<<<1, 1>>>) and it takes about 31 seconds. My GPU is GTS 250M, so it has 96 cores, and therefore we divide this by 96, thus achieving 0.32 seconds execution time. This means, that my GPU is twice as fast for integer computations as my CPU (for float it’s about 3.5 times faster).

Am I conducting the experiment correctly, and is there a measure that gives me these numbers? Obviously, the ‘GFLOPS’ one is not quite good. Say we don’t care about memory latency and bandwidth and we have perfectly parallelisable problems. Computation only.

P.S. May I ask how shared memory compares to CPU’s L1 and L2 caches in terms of cycles? I’m finding contradictory numbers…

Thanks very much,

Yordan

The scale up arithmetic won’t be linear. There is a lot of fixed latency in the GPU architecture. Conventional wisdom says that for a GPU like your GTS 250M, 192 threads per 8 core MP are required to amortise all the instruction pipeline latency, for example.

@avidday: thank you for your answer. But when I increase the number of threads, the execution time seem to increase linearly (even worse). How is that explained, and why don’t I get extra credit for the better “utilisation”? Here are the results for the same program (with count decreased 4 times)

<<<1, 1>>>  -    7.9 s

<<<12, 1>>> - 7.9 s (my GPU has 12 SMs)
<<<96, 1>>> - 9.0 s (my GPU has 96 SPs)
<<<97, 1>>> - 16.8 s
<<<192, 1>>> - 18.4 s
<<<193, 1>>> - 25.7 s

Threads are run in warps of 32 something like an SIMD machine, which is why 1 and 12 threads have the same runtime (1-32 threads should have the same run time for this reason). It is also why 96/97 threads and 192/193 threads have such different run times: 96 threads = 3 warps, 97 threads = 4 warps, and 192=6 warps, 193=7 warps. “Partial” warps get filled with “dummy” threads which are masked off and don’t actually do anything.

In every case, because you are only launching a single block, the kernel runs on only a single SM because blocks always only run on a single SM. You won’t get mulitple SMs running until you run enough blocks to fill the first SM (probably 8 in this case given the trivial size of the kernel). You should be able to run something like 96 blocks and still get about the same execution time as only one block at any given number of threads per block.

The behaviour of the GPU is not like a typical SMP scalar processor in this respect.

Thanks, avidday! I seem to be familiar with most of that stuff, but my question is actually why there is more than a 1 second difference between 12 and 96 blocks? 12 would execute in a single warp, 96 would execute in 3 warps, but they should be scheduled on different SPs, hence execute perfectly in parallel.

Wrong, 32 threads of ONE block form a warp and are executed in parallel. At any given time one instruction of one warp of one block can be executed on a Multiprocessor of the GPU.

So if you start <<1,1>> that takes as long as <<1,32>> (one warp is executed in either case). But if you start <<32,1>> that will take longer than <<1,1>> because 32 warps have to be executed (and partially sequential, since there are not enough multiprocs to run them simultaniously). The threads of one warp are executed in parallel on the SP (cores) of one MP - the MP on which that block happens to be excuted.

Ceearem

One addition: this means you should try something like this:

<<1,1>>
<<1,32>>
<<1,128>>
<<1,256>>
<<4,32>>
<<4,128>>
<<8,32>>
<<8,128>>
<<16,32>>

=(count-1)*count/2 take 0 ms :)
after it s depend how you write read page 34 at the end

http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

for processor try because each core can do 3 process in same time
int sum = 0;

    for (int i = 0; i < count; i=i+3) {
            sum1 += i;
            sum2 += i+1;
            sum3 += i+2;
    }
    *result = sum1+sum2+sum3;