device speed vs. host speed Why is my device program so slow?

diofant · August 12, 2007, 10:37am

I noticed that my application, which involves only integer arithmetic, is running incredibly slow on the graphics card (8800GTS with 320MB RAM). It is running much faster in emulation mode. The speed difference is so huge that it makes no sense at all to use CUDA, even if an almost perfect parallelisation of the application were possible.

What am I doing wrong? This is a very simple test program test.cu:

[font=“Courier”]#include <stdio.h>
#include <cutil.h>

global void run(int *x)
{
int y=0;
for (int i=0; i<1000; i++)
for (int j=0; j<100000; j++) { y++; if (y==2007) y=0; }

*x = y;

}

int main(int argc, char** argv)
{
CUT_DEVICE_INIT();

int x, *dx; cudaMalloc((void **) &dx, sizeof(int));

dim3 grid(1);
dim3 threads(1);

unsigned int timer = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));

run<<< grid, threads >>>(dx);
CUDA_SAFE_CALL(cudaMemcpy(&x, dx, sizeof(int), cudaMemcpyDeviceToHost));

printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));
printf("x = %d\n", x);

}[/font]

Running the emulation (host processor has 2.6GHz) gives this output:

[font=“Courier”]Processing time: 487.514008 (ms)
x = 1225[/font]

Running the real thing I get this:

[font=“Courier”]Processing time: 11812.893555 (ms)
x = 1225[/font]

What’s going on? Maybe somebody can post their results running this test?

Markus

mfatica · August 12, 2007, 4:29pm

You are using 1 processor out of the 96 available on your GTS.

diofant · August 12, 2007, 6:51pm

That was my intention. I wanted to compare the performance of a single GPU processor to the performance of the host processor. The factor is about 1/24. So 96 GPU processors together would have a performance of just 96/24 = 4 times the single host processor (which is not even a dual core system, but a very cheap Celeron). And the real world performance of 96 processors in an application is never as high as 96 times the performance of a single GPU processor, of course, even if there wouldn’t be all the restrictions imposed by the programming model.

Is the integer performance of a single GPU processor really that poor?

Markus

PINS · August 12, 2007, 10:57pm

I’d say the problem is loop performance. I’d expect a single GPU processor to be slower than the CPU on that.

There are several issues here, all related to specific hardware implementation, like: different clock speeds, instruction pre-fetching and dispatch, caching, compiler optimizations, etc.

If you’re curious, compare the asm assembly of the CPU code with the ptx assembly for the GPU code. I’m guessing they’ll be fairly different.

I’d try to make a different kernel if you just want to measure integer performance. AFAIK, it should be similar to floating point performance using CUDA.

seibert · August 13, 2007, 7:10pm

I did some playing around, and can confirm roughly the same thing you saw, but on a 8800 GTX (128 processors) running on a heavily loaded 2.4 GHz dual core Athlon64:

CPU

Processing time: 311.145996 (ms)

x = 1225

GPU

Processing time: 10396.617188 (ms)

x = 1225

Then I did some tests where I increased the number of threads and blocks, but decreased the number of iterations in the inner loop proportionately. So the code should perform roughly the same number of integer operations, just spread over all the ALUs. (Note the return value will vary since the inner loop limit is changing.)

16 blocks x 16 threads

Processing time: 40.786999 (ms)

x = 1642

32 blocks x 32 threads

Processing time: 10.318000 (ms)

x = 1664

64 blocks x 64 threads

Processing time: 3.434000 (ms)

x = 916

128 blocks x 128 threads

Processing time: 5.329000 (ms)

x = 979

Based on this, my guess is that your single-threaded kernel is not keeping the pipeline very full. You really can’t extrapolate performance from one thread unfortunately, since the GPU hardware is almost de-optimized for that case.

diofant · August 14, 2007, 6:34am

Many thanks for your efforts – I will repeat these experiments later this day.

So the mistake was that I assumed that the speedup when parallelizing the application would be restricted to a factor of about 96, the number of processors. The truth is that you achieved a speedup of more than 3000 by using 4096 threads. I cannot expect this speedup for a real application, but even a factor of 300 would be a nice success.

Markus

diofant · August 15, 2007, 4:51pm

Here I am again. I tried to speed up my application (not the above test snippet) by a factor of >96 by using many threads. This worked, indeed. The best configuration was 24 blocks, each with 24 threads.

Still, the program in this configuration was only slightly faster on the GPU than with a single thread in emulation mode. This was not astonishing, because in the 1 block / 1 thread configuration the GPU run needs about 350 times the time of the emulation run.

My first thought was that maybe this is due to use of local memory instead of registers. But I was able to strip down the program to a few lines which behave even worse:

#include <stdio.h>

#include <cutil.h>

__global__ void run(int *X)

{

    int x=1,y,z;

    for (int i=0; i<200000000; i++) {

	y = x+1; z = y+1; x = z+1;

    }

    *X = x+y+z;

}

int main(int argc, char** argv)

{

    CUT_DEVICE_INIT();

   int *X; cudaMalloc((void **) &X, sizeof(int));

   dim3 grid(1);

    dim3 threads(1);

   unsigned int timer = 0;

   CUT_SAFE_CALL(cutCreateTimer(&timer));

   CUT_SAFE_CALL(cutStartTimer(timer));

   run<<< grid, threads >>>(X);

   int x; CUDA_SAFE_CALL(cudaMemcpy(&x, X, sizeof(int), cudaMemcpyDeviceToHost));

   printf("Result = %d\n", x);

    printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

}

Running in emulation mode:

Result = 1800000000

Processing time: 4.223000 (ms)

Running on the GPU:

Result = 1800000000

Processing time: 13504.950195 (ms)

That’s a factor of almost 3200. ???

Markus

diofant · August 16, 2007, 3:20pm

It seems that the host compiler is busy as a bee and optimises the loop out of my program (I tried to prevent this by the x=…,y=…,z=… construction, but g++ is just too clever). This leads to the absurd 4ms result. Have to do more experiments…

Markus

paulius · August 16, 2007, 11:49pm

Some quick notes:
Running only one thread on the GPU is very inefficient.

Instructions have latencies (the Programming Guide states that one should run at least 192 threads in parallel to hide latency due to register dependencies). If you were to hide the latencies, you’d see a very significant speedup factor.

Memory system on a GPU is designed for parallel processing. On current hardware, unless you have a multiple of 16 threads per block, you can’t get full coalescing for memory reads/writes (see Programming guide for details). But, if you have 16 threads read/write in a coalesced way, you’ll see up to 10x speedup (and that’s without the latency hiding). For global memory reads alone, if you combine latency hiding (say 384 or more threads running concurrently on each multiprocessor (so that’s 16x384 threads total)) and coalescing, you should see a speedup of about 40x. All of this is ignoring parallelizing actual computations, which would further help.

CPU code likely does loop unrolling as part of the optimizations. The current compiler does not unroll loops for GPUs. However, since branches do add overhead, you would see a benefit if you unrolled your loops by hand (say, reduce the number of inner loop iterations by a factor of 16, at the same time increasing the amount of work per iteration).

Paulius

Topic		Replies	Views
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1825	October 26, 2017
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8609	December 18, 2008
Comparing GPUs to CPUs in a particular situation CUDA Programming and Performance	7	17160	April 4, 2011
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6032	December 8, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6529	February 19, 2009
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	157	July 7, 2024
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
device emulation faster than gpu CUDA Programming and Performance	2	983	March 20, 2009
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008

device speed vs. host speed Why is my device program so slow?

Related topics