CUDA perormances

Hi all,

i have to show to my team GPU performances compared to CPU’s.
It sure is correct to say that CPU execution-time increases linearly with the number of iterations.
I have some problems with GPU one tough.
I am referring to the variables allocation time PLUS execution time.
A fact is that there is a fixed amount of time needed to allocate variables and get the results back. If programmed properly the global computational time should increase much less faster (or better negligibly) than CPU one.
My problem is to understand what happens if the number of operation is very low. I had a case where i used a kernel module with only 50 operations and the global computing time was of several seconds. Instead when the operation executed were several thousands this time was of a bunch of milliseconds.
Is that coherent with your experiences?
If not what could be the trick? And if yes how can we justify the fact that the computational time decreases?
I hope i have been clear enough. Could someone tell me what should be the differences if we execute the same kernel module allocating very few (10 or 50) or several thousands (10M or 500M)of threads?

Can you please clarify what grid and block sizes have you used?

Launching kernel causes some overhead. It has been reported on this forum that for large grids (with dimensions greater than 8192 if I remember correctly) overhead is much higher than for tiny grids (documentation says nothing about this dependency, so this may be some kind of bug which will be fixed). If we don’t count this overhead then running time should scale linearly with number of thread blocks.

From practical point of view you should avoid using too small grids (as this will increase overall overhead) and too large kernels (as this will freeze display and may trigger 5sec limit).

The startup time on my machine (quad-core with 8800GLX) is about 0.15 seconds. This includes allocating (almost all) memory on the device. So this is not really a problem.

Copying memory to and from the device can take a significant amount of time, although it can handle up to 8 GB/s:
Dell on PCI-X

If you perform the calculation on global memory, it will be slow. Copy the data to shared memory, perform the calculations and copy the result back. This will be much faster.

There many more factors which will increase or decrease performance but my rule of thumb is: do as little with global memory as possible, perform as much calculations on shared/local memory as possible and avoid the PCI-X bus ;-)
Then you can look at coalescing, thread divergence etc.

Oh, and use all the 512 threads in a block with as many blocks as possible!

I tested with matrixMul example

here you can see speedup and calculation time (logarithmic scale)
External Media
External Media

256 may be a better choice (if you do not have a lot of information to share between threads) You can have 768 threads per multiprocessor, so with 512 you can have only 1 thread-block per MP, but with 256 you can have 3 thread-blocks per MP (when not using too many registers & shared memory) So latency hiding may be more effective with 256 threads per block.

Agree. I have one kernel which uses < 16 registers, so I may run 512 threads per block, but performance with 256 threads/block is about 25% better.

Thanx guys!

I though it would be more efficient to use as much threads per block as possible. Some time ago I started with say 300 threads (nice blocks of 10x30 ;-) but noticed an increase in performance when stepping up to 512 (16x32). I used this ‘knowledge’ for another programme. But I’v just reduced the matrix for this one by half (16x16). Running time went from .92 to .58 seconds.

many many thanks for this insight!

If only the max amount of threads per multiprocessor had been 1024 :D

thanks for the answers guys. But this chart caught my attention.

Comparing the values for N=1 and N=10 the ratio decreases.

This means that for N=10 the CPU is slower if compared to GPU then it is for N=1?

Is that logic? And what exactly is N?

well, check the matrixmul example maybe? Jordy stated that he tested it with that example, so I guess the N is coming from there.

Blocks sizes that are not multiples of 32 threads will lead to inefficient utilization of the GPU. That chart shows block sizes from 1 to 16 threads. Since the warp size of the GPU is 32 threads, these sizes will not even keep a single multiprocessor busy or hide pipeline latency. I recommend using blocks between 128 and 384 threads, in multiples of 32 threads.

Mark