device speed vs. host speed Why is my device program so slow?

I noticed that my application, which involves only integer arithmetic, is running incredibly slow on the graphics card (8800GTS with 320MB RAM). It is running much faster in emulation mode. The speed difference is so huge that it makes no sense at all to use CUDA, even if an almost perfect parallelisation of the application were possible.

What am I doing wrong? This is a very simple test program test.cu:

[font=“Courier”]#include <stdio.h>
#include <cutil.h>

global void run(int *x)
{
int y=0;
for (int i=0; i<1000; i++)
for (int j=0; j<100000; j++) { y++; if (y==2007) y=0; }

*x = y;

}

int main(int argc, char** argv)
{
CUT_DEVICE_INIT();

int x, *dx; cudaMalloc((void **) &dx, sizeof(int));

dim3 grid(1);
dim3 threads(1);

unsigned int timer = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));

run<<< grid, threads >>>(dx);
CUDA_SAFE_CALL(cudaMemcpy(&x, dx, sizeof(int), cudaMemcpyDeviceToHost));

printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));
printf("x = %d\n", x);

}[/font]

Running the emulation (host processor has 2.6GHz) gives this output:

[font=“Courier”]Processing time: 487.514008 (ms)
x = 1225[/font]

Running the real thing I get this:

[font=“Courier”]Processing time: 11812.893555 (ms)
x = 1225[/font]

What’s going on? Maybe somebody can post their results running this test?

Markus

You are using 1 processor out of the 96 available on your GTS.

That was my intention. I wanted to compare the performance of a single GPU processor to the performance of the host processor. The factor is about 1/24. So 96 GPU processors together would have a performance of just 96/24 = 4 times the single host processor (which is not even a dual core system, but a very cheap Celeron). And the real world performance of 96 processors in an application is never as high as 96 times the performance of a single GPU processor, of course, even if there wouldn’t be all the restrictions imposed by the programming model.

Is the integer performance of a single GPU processor really that poor?

Markus

I’d say the problem is loop performance. I’d expect a single GPU processor to be slower than the CPU on that.

There are several issues here, all related to specific hardware implementation, like: different clock speeds, instruction pre-fetching and dispatch, caching, compiler optimizations, etc.

If you’re curious, compare the asm assembly of the CPU code with the ptx assembly for the GPU code. I’m guessing they’ll be fairly different.

I’d try to make a different kernel if you just want to measure integer performance. AFAIK, it should be similar to floating point performance using CUDA.

I did some playing around, and can confirm roughly the same thing you saw, but on a 8800 GTX (128 processors) running on a heavily loaded 2.4 GHz dual core Athlon64:

CPU

Processing time: 311.145996 (ms)

x = 1225

GPU

Processing time: 10396.617188 (ms)

x = 1225

Then I did some tests where I increased the number of threads and blocks, but decreased the number of iterations in the inner loop proportionately. So the code should perform roughly the same number of integer operations, just spread over all the ALUs. (Note the return value will vary since the inner loop limit is changing.)

16 blocks x 16 threads

Processing time: 40.786999 (ms)

x = 1642

32 blocks x 32 threads

Processing time: 10.318000 (ms)

x = 1664

64 blocks x 64 threads

Processing time: 3.434000 (ms)

x = 916

128 blocks x 128 threads

Processing time: 5.329000 (ms)

x = 979

Based on this, my guess is that your single-threaded kernel is not keeping the pipeline very full. You really can’t extrapolate performance from one thread unfortunately, since the GPU hardware is almost de-optimized for that case.

Many thanks for your efforts – I will repeat these experiments later this day.

So the mistake was that I assumed that the speedup when parallelizing the application would be restricted to a factor of about 96, the number of processors. The truth is that you achieved a speedup of more than 3000 by using 4096 threads. I cannot expect this speedup for a real application, but even a factor of 300 would be a nice success.

Markus

Here I am again. I tried to speed up my application (not the above test snippet) by a factor of >96 by using many threads. This worked, indeed. The best configuration was 24 blocks, each with 24 threads.

Still, the program in this configuration was only slightly faster on the GPU than with a single thread in emulation mode. This was not astonishing, because in the 1 block / 1 thread configuration the GPU run needs about 350 times the time of the emulation run.

My first thought was that maybe this is due to use of local memory instead of registers. But I was able to strip down the program to a few lines which behave even worse:

#include <stdio.h>

#include <cutil.h>

__global__ void run(int *X)

{

    int x=1,y,z;

    for (int i=0; i<200000000; i++) {

	y = x+1; z = y+1; x = z+1;

    }

    *X = x+y+z;

}

int main(int argc, char** argv)

{

    CUT_DEVICE_INIT();

   int *X; cudaMalloc((void **) &X, sizeof(int));

   dim3 grid(1);

    dim3 threads(1);

   unsigned int timer = 0;

   CUT_SAFE_CALL(cutCreateTimer(&timer));

   CUT_SAFE_CALL(cutStartTimer(timer));

   run<<< grid, threads >>>(X);

   int x; CUDA_SAFE_CALL(cudaMemcpy(&x, X, sizeof(int), cudaMemcpyDeviceToHost));

   printf("Result = %d\n", x);

    printf("Processing time: %f (ms)\n", cutGetTimerValue(timer));

}

Running in emulation mode:

Result = 1800000000

Processing time: 4.223000 (ms)

Running on the GPU:

Result = 1800000000

Processing time: 13504.950195 (ms)

That’s a factor of almost 3200. ???

Markus

It seems that the host compiler is busy as a bee and optimises the loop out of my program (I tried to prevent this by the x=…,y=…,z=… construction, but g++ is just too clever). This leads to the absurd 4ms result. Have to do more experiments…

Markus

Some quick notes:
Running only one thread on the GPU is very inefficient.

Instructions have latencies (the Programming Guide states that one should run at least 192 threads in parallel to hide latency due to register dependencies). If you were to hide the latencies, you’d see a very significant speedup factor.

Memory system on a GPU is designed for parallel processing. On current hardware, unless you have a multiple of 16 threads per block, you can’t get full coalescing for memory reads/writes (see Programming guide for details). But, if you have 16 threads read/write in a coalesced way, you’ll see up to 10x speedup (and that’s without the latency hiding). For global memory reads alone, if you combine latency hiding (say 384 or more threads running concurrently on each multiprocessor (so that’s 16x384 threads total)) and coalescing, you should see a speedup of about 40x. All of this is ignoring parallelizing actual computations, which would further help.

CPU code likely does loop unrolling as part of the optimizations. The current compiler does not unroll loops for GPUs. However, since branches do add overhead, you would see a benefit if you unrolled your loops by hand (say, reduce the number of inner loop iterations by a factor of 16, at the same time increasing the amount of work per iteration).

Paulius