ideal number of tread per block

Hi everybody
I’m working with a tesla c1060 and I get different computing times with different numbers of thread per block. can anyone tell me which is the ideal number of thread pr block with this device or also generally?

It depends on your problem/algorithm… experiment with different configuration arguments.

I just don’t understand why increasing the number of thread per block the execution time gets worse!

I use always the same parameters: with 32 thread per block I obtain an execution time of 8669.04 ms

                                                      with 64 thread per block I obtain an execution time of 14941.82 ms

                                                      with 128 thread per block I obtain an execution time of 29258.06 ms

                                                      with 256 thread per block I obtain an execution time of 40197.43 ms

I think that the execution time should be exactly opposite that is if I increase the number of thread per block then the execution time decrease and viceversa.

thanx a lot

As already said: it depends on what you are doing. Without more information it is impossible to tell why your code behaves the way it does.

I’m trying to simulate the collective motion of a set of particles in a fluid.

I think jjp was thinking more of some sample code and register usage/shared mem usage in your kernel…

Could be that you don’t have enough thread blocks so you are using less SM and thus not fully utilizing the card. Could be partition camping, warp serialization and a bunch of other things.

We need to know more about the algorithm and problem size to be able to tell more

the kernel with this execution time uses a shared memory containing a number of elements equal to the number of threads per block; each thread of each block inserts an element into the shared memory, then i synchornize it all making sure that all the threads have contribuited to the construction of the shared memory. following i apply a particular equation to simulate the model! I tested this program with different number of threads per block from 2 up to 256 (only powers of 2) and i noticed that with 32 thread per block i obtain the best performance…but 32 isn’tthe maximum number of thread per block? can anyone explain what’s the difference between the maximum number of thread per block and the ideal number of thread per block? i think they should be the same

Hmm, not sure… But, if you aren’t going to provide us with code then tell us what the configuration arguments are, the occupancy, the kernel register usage, if you read to global memory, etc.

But, something that comes to mind – are you SURE that you aren’t compiling/running in device emulation mode? :P
[ That is, are you passing the ‘-deviceemu’ argument to nvcc? ]

There are only 8 SPs that are executing these threads so its not like your getting more computing resources because there are more threads on the block.

One thing that is quite common is that if you are able to fit many blocks on each SM you will be able to hide more and more of the off chip latency via the context switching that is done.

If your threads are heavy, for example consuming many register or more shared memory, the number of active blocks will be lowered. Also in your example, using just 32 threads will mean that the syncing is no longer required.

“2 up to 256 (only powers of 2)” - try to keep it to multiples of 32 since that is the warp size.

Take a look at Jimmy’s answer.

Also, can you compile the .cu file with the following params: --ptxas-options="-v"

That should show you the resources your kernel uses, it should be something like this:

1>ptxas info : Used 17 registers, 3272+16 bytes smem, 27328 bytes cmem[0], 40 bytes cmem[1]

Putting this info into the occupancy calculator + the number of threads should give you the occupancy your kernel uses per thread count.

Maybe that could shed more light on what you see.

The only limit on the number of threads per block is what you see in the output for deviceQuery from the SDK. For example:

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Within those limits you can put any number of threads, however as Jimmy explained in his post a multiplication of 32 would probably be best.

You need to find the balance between resource pressure and hiding latency and the logic flow in each one of your kernels, this is mostly

tests that you can do to find the best amount of threads per block. You just need to play with it a bit…

Anyway, as I said in the previous post, if you can post the kernel code that might also shed some light on potential problems in the kernel itself

causing this behaviour.