Comparing GPUs to CPUs in a particular situation

Yordan_Zaykov · March 1, 2011, 12:26pm

Say I have an embarrassingly parallel problem, which can be expressed in the SIMD model. Also, all the data needed can fit in core’s registers, thus eliminating memory latency issues. I want to compare my GPU to my CPU in terms of computation performance.

I use the following simple program:

int sum = 0;

        for (int i = 0; i < count; ++i) {

                sum += i;

        }

        *result = sum;

The compiler will place i, sum and count in the registers, right? Now I test this on my CPU, which happens to be Core-i7 720QM (Quad-core, 1.6GHz w/o TurboBoost, 2.8GHz with TurboBoost). It takes about 2.7 seconds to execute for a given count, so the compiler in not optimising the loop away. Also, TurboBoost has been turned off from the BIOS. Now, in an ideal world, lets say I can distribute this across my four cores and achieve 2.7/4 = 0.675 seconds execution time. Now I test this on a single GPU core (<<<1, 1>>>) and it takes about 31 seconds. My GPU is GTS 250M, so it has 96 cores, and therefore we divide this by 96, thus achieving 0.32 seconds execution time. This means, that my GPU is twice as fast for integer computations as my CPU (for float it’s about 3.5 times faster).

Am I conducting the experiment correctly, and is there a measure that gives me these numbers? Obviously, the ‘GFLOPS’ one is not quite good. Say we don’t care about memory latency and bandwidth and we have perfectly parallelisable problems. Computation only.

P.S. May I ask how shared memory compares to CPU’s L1 and L2 caches in terms of cycles? I’m finding contradictory numbers…

Thanks very much,

Yordan

avidday · March 1, 2011, 1:08pm

The scale up arithmetic won’t be linear. There is a lot of fixed latency in the GPU architecture. Conventional wisdom says that for a GPU like your GTS 250M, 192 threads per 8 core MP are required to amortise all the instruction pipeline latency, for example.

Yordan_Zaykov · March 1, 2011, 3:08pm

@avidday: thank you for your answer. But when I increase the number of threads, the execution time seem to increase linearly (even worse). How is that explained, and why don’t I get extra credit for the better “utilisation”? Here are the results for the same program (with count decreased 4 times)

<<<1, 1>>>  -    7.9 s

<<<12, 1>>> - 7.9 s (my GPU has 12 SMs)
<<<96, 1>>> - 9.0 s (my GPU has 96 SPs)
<<<97, 1>>> - 16.8 s
<<<192, 1>>> - 18.4 s
<<<193, 1>>> - 25.7 s

avidday · March 1, 2011, 3:27pm

Threads are run in warps of 32 something like an SIMD machine, which is why 1 and 12 threads have the same runtime (1-32 threads should have the same run time for this reason). It is also why 96/97 threads and 192/193 threads have such different run times: 96 threads = 3 warps, 97 threads = 4 warps, and 192=6 warps, 193=7 warps. “Partial” warps get filled with “dummy” threads which are masked off and don’t actually do anything.

In every case, because you are only launching a single block, the kernel runs on only a single SM because blocks always only run on a single SM. You won’t get mulitple SMs running until you run enough blocks to fill the first SM (probably 8 in this case given the trivial size of the kernel). You should be able to run something like 96 blocks and still get about the same execution time as only one block at any given number of threads per block.

The behaviour of the GPU is not like a typical SMP scalar processor in this respect.

Yordan_Zaykov · March 1, 2011, 9:45pm

Thanks, avidday! I seem to be familiar with most of that stuff, but my question is actually why there is more than a 1 second difference between 12 and 96 blocks? 12 would execute in a single warp, 96 would execute in 3 warps, but they should be scheduled on different SPs, hence execute perfectly in parallel.

ceearem · March 1, 2011, 11:41pm

Wrong, 32 threads of ONE block form a warp and are executed in parallel. At any given time one instruction of one warp of one block can be executed on a Multiprocessor of the GPU.

So if you start <<1,1>> that takes as long as <<1,32>> (one warp is executed in either case). But if you start <<32,1>> that will take longer than <<1,1>> because 32 warps have to be executed (and partially sequential, since there are not enough multiprocs to run them simultaniously). The threads of one warp are executed in parallel on the SP (cores) of one MP - the MP on which that block happens to be excuted.

Ceearem

ceearem · March 1, 2011, 11:43pm

One addition: this means you should try something like this:

<<1,1>>
<<1,32>>
<<1,128>>
<<1,256>>
<<4,32>>
<<4,128>>
<<8,32>>
<<8,128>>
<<16,32>>
…

cricri1 · April 4, 2011, 11:27am

=(count-1)*count/2 take 0 ms :)
after it s depend how you write read page 34 at the end

http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

for processor try because each core can do 3 process in same time
int sum = 0;

    for (int i = 0; i < count; i=i+3) {
            sum1 += i;
            sum2 += i+1;
            sum3 += i+2;
    }
    *result = sum1+sum2+sum3;

Topic		Replies	Views
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7887	August 16, 2007
Measuring speed of a calculation in a single thread CUDA Programming and Performance	6	1125	March 2, 2011
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1825	October 26, 2017
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6529	February 19, 2009
single thread performance CUDA Programming and Performance	2	2536	July 30, 2008
Basic Cuda Confusion - help CUDA Programming and Performance	9	1895	February 11, 2013
Low processor efficiency with almost same CUDA kernels CUDA Programming and Performance	4	682	April 9, 2018
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8594	December 7, 2009
Scalability evaluation CUDA Programming and Performance	12	4003	May 30, 2009

Comparing GPUs to CPUs in a particular situation

Related topics