ideal number of tread per block

tom74 · March 23, 2010, 8:22pm

Hi everybody
I’m working with a tesla c1060 and I get different computing times with different numbers of thread per block. can anyone tell me which is the ideal number of thread pr block with this device or also generally?
thanx

zeus13i · March 23, 2010, 11:10pm

It depends on your problem/algorithm… experiment with different configuration arguments.

tom74 · March 24, 2010, 10:43am

I just don’t understand why increasing the number of thread per block the execution time gets worse!

I use always the same parameters: with 32 thread per block I obtain an execution time of 8669.04 ms

                                                      with 64 thread per block I obtain an execution time of 14941.82 ms

                                                      with 128 thread per block I obtain an execution time of 29258.06 ms

                                                      with 256 thread per block I obtain an execution time of 40197.43 ms

I think that the execution time should be exactly opposite that is if I increase the number of thread per block then the execution time decrease and viceversa.

thanx a lot

jjp · March 24, 2010, 12:37pm

As already said: it depends on what you are doing. Without more information it is impossible to tell why your code behaves the way it does.

tom74 · March 24, 2010, 2:18pm

I’m trying to simulate the collective motion of a set of particles in a fluid.

eyalhir74 · March 24, 2010, 2:25pm

I think jjp was thinking more of some sample code and register usage/shared mem usage in your kernel…

laughingrice · March 24, 2010, 10:24pm

Could be that you don’t have enough thread blocks so you are using less SM and thus not fully utilizing the card. Could be partition camping, warp serialization and a bunch of other things.

We need to know more about the algorithm and problem size to be able to tell more

tom74 · March 24, 2010, 10:24pm

the kernel with this execution time uses a shared memory containing a number of elements equal to the number of threads per block; each thread of each block inserts an element into the shared memory, then i synchornize it all making sure that all the threads have contribuited to the construction of the shared memory. following i apply a particular equation to simulate the model! I tested this program with different number of threads per block from 2 up to 256 (only powers of 2) and i noticed that with 32 thread per block i obtain the best performance…but 32 isn’tthe maximum number of thread per block? can anyone explain what’s the difference between the maximum number of thread per block and the ideal number of thread per block? i think they should be the same

zeus13i · March 24, 2010, 11:10pm

Hmm, not sure… But, if you aren’t going to provide us with code then tell us what the configuration arguments are, the occupancy, the kernel register usage, if you read to global memory, etc.

But, something that comes to mind – are you SURE that you aren’t compiling/running in device emulation mode? :P
[ That is, are you passing the ‘-deviceemu’ argument to nvcc? ]

Jimmy_Pettersson · March 24, 2010, 11:27pm

There are only 8 SPs that are executing these threads so its not like your getting more computing resources because there are more threads on the block.

One thing that is quite common is that if you are able to fit many blocks on each SM you will be able to hide more and more of the off chip latency via the context switching that is done.

If your threads are heavy, for example consuming many register or more shared memory, the number of active blocks will be lowered. Also in your example, using just 32 threads will mean that the syncing is no longer required.

“2 up to 256 (only powers of 2)” - try to keep it to multiples of 32 since that is the warp size.

eyalhir74 · March 25, 2010, 7:21am

Take a look at Jimmy’s answer.

Also, can you compile the .cu file with the following params: --ptxas-options=“-v”

That should show you the resources your kernel uses, it should be something like this:

1>ptxas info : Used 17 registers, 3272+16 bytes smem, 27328 bytes cmem[0], 40 bytes cmem[1]

Putting this info into the occupancy calculator + the number of threads should give you the occupancy your kernel uses per thread count.

Maybe that could shed more light on what you see.

The only limit on the number of threads per block is what you see in the output for deviceQuery from the SDK. For example:

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Within those limits you can put any number of threads, however as Jimmy explained in his post a multiplication of 32 would probably be best.

You need to find the balance between resource pressure and hiding latency and the logic flow in each one of your kernels, this is mostly

tests that you can do to find the best amount of threads per block. You just need to play with it a bit…

Anyway, as I said in the previous post, if you can post the kernel code that might also shed some light on potential problems in the kernel itself

causing this behaviour.

eyal

Topic		Replies	Views
Basic Cuda Confusion - help CUDA Programming and Performance	9	2001	February 11, 2013
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	286	December 31, 2024
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10512	July 7, 2009
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	2384	December 19, 2023
Ideal number of thread per bloc CUDA Programming and Performance	9	3518	February 5, 2008
finding the best number of threads per block CUDA Programming and Performance	3	7895	January 29, 2010
Maximum number of blocks Legacy PGI Compilers	5	2473	April 7, 2020
Creating new threads increases execution time ? CUDA Programming and Performance	1	4989	June 21, 2009
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	53724	June 7, 2022
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7078	January 30, 2008

ideal number of tread per block

Related topics