CUDA perormances

Uliveto · January 18, 2008, 10:42am

Hi all,

i have to show to my team GPU performances compared to CPU’s.
It sure is correct to say that CPU execution-time increases linearly with the number of iterations.
I have some problems with GPU one tough.
I am referring to the variables allocation time PLUS execution time.
A fact is that there is a fixed amount of time needed to allocate variables and get the results back. If programmed properly the global computational time should increase much less faster (or better negligibly) than CPU one.
My problem is to understand what happens if the number of operation is very low. I had a case where i used a kernel module with only 50 operations and the global computing time was of several seconds. Instead when the operation executed were several thousands this time was of a bunch of milliseconds.
Is that coherent with your experiences?
If not what could be the trick? And if yes how can we justify the fact that the computational time decreases?
I hope i have been clear enough. Could someone tell me what should be the differences if we execute the same kernel module allocating very few (10 or 50) or several thousands (10M or 500M)of threads?

AndreiB · January 18, 2008, 11:32am

Can you please clarify what grid and block sizes have you used?

Launching kernel causes some overhead. It has been reported on this forum that for large grids (with dimensions greater than 8192 if I remember correctly) overhead is much higher than for tiny grids (documentation says nothing about this dependency, so this may be some kind of bug which will be fixed). If we don’t count this overhead then running time should scale linearly with number of thread blocks.

From practical point of view you should avoid using too small grids (as this will increase overall overhead) and too large kernels (as this will freeze display and may trigger 5sec limit).

S.Warris · January 18, 2008, 11:57am

The startup time on my machine (quad-core with 8800GLX) is about 0.15 seconds. This includes allocating (almost all) memory on the device. So this is not really a problem.

Copying memory to and from the device can take a significant amount of time, although it can handle up to 8 GB/s:
Dell on PCI-X

If you perform the calculation on global memory, it will be slow. Copy the data to shared memory, perform the calculations and copy the result back. This will be much faster.

There many more factors which will increase or decrease performance but my rule of thumb is: do as little with global memory as possible, perform as much calculations on shared/local memory as possible and avoid the PCI-X bus ;-)
Then you can look at coalescing, thread divergence etc.

Oh, and use all the 512 threads in a block with as many blocks as possible!

jordyvaneijk · January 18, 2008, 12:00pm

I tested with matrixMul example

here you can see speedup and calculation time (logarithmic scale)
External Media
External Media

DenisR · January 18, 2008, 12:26pm

256 may be a better choice (if you do not have a lot of information to share between threads) You can have 768 threads per multiprocessor, so with 512 you can have only 1 thread-block per MP, but with 256 you can have 3 thread-blocks per MP (when not using too many registers & shared memory) So latency hiding may be more effective with 256 threads per block.

AndreiB · January 18, 2008, 12:49pm

Agree. I have one kernel which uses < 16 registers, so I may run 512 threads per block, but performance with 256 threads/block is about 25% better.

S.Warris · January 18, 2008, 1:30pm

Thanx guys!

I though it would be more efficient to use as much threads per block as possible. Some time ago I started with say 300 threads (nice blocks of 10x30 ;-) but noticed an increase in performance when stepping up to 512 (16x32). I used this ‘knowledge’ for another programme. But I’v just reduced the matrix for this one by half (16x16). Running time went from .92 to .58 seconds.

many many thanks for this insight!

DenisR · January 18, 2008, 1:55pm

If only the max amount of threads per multiprocessor had been 1024 :D

Uliveto · January 21, 2008, 2:29pm

thanks for the answers guys. But this chart caught my attention.

Comparing the values for N=1 and N=10 the ratio decreases.

This means that for N=10 the CPU is slower if compared to GPU then it is for N=1?

Is that logic? And what exactly is N?

DenisR · January 21, 2008, 6:35pm

well, check the matrixmul example maybe? Jordy stated that he tested it with that example, so I guess the N is coming from there.

Mark_Harris · January 22, 2008, 2:58pm

Blocks sizes that are not multiples of 32 threads will lead to inefficient utilization of the GPU. That chart shows block sizes from 1 to 16 threads. Since the warp size of the GPU is 32 threads, these sizes will not even keep a single multiprocessor busy or hide pipeline latency. I recommend using blocks between 128 and 384 threads, in multiples of 32 threads.

Mark

Topic		Replies	Views
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7061	January 30, 2008
Kernel execution time variable execution time depending on grid CUDA Programming and Performance	1	4806	March 30, 2010
Waiting for global memory access. CUDA Programming and Performance	32	56551	January 31, 2008
How does number of blocks of threads effect gpu performance CUDA Programming and Performance	1	523	June 21, 2011
cuda gpu slower than cpu CUDA Programming and Performance	2	1110	May 1, 2012
Optimum perfomance Blocks/Treadd/Dimensions CUDA Programming and Performance	5	4817	January 16, 2009
Newbie: More threads == much slower? :( CUDA Programming and Performance	4	2131	July 25, 2008
CUDA runtimes bigger that CPU without loading time include CUDA Programming and Performance	10	2476	April 28, 2010
newbie, microprocessors CUDA Programming and Performance	7	4773	March 26, 2008
About grid size and performance CUDA Programming and Performance	10	2517	June 25, 2010

CUDA perormances

Related topics