Performance in different thread-block schemes

Hi all,

I have a certain amount of data to run in my GPU program. I change the # of threads per block to see the different performance. Better performance were observed in a smaller # of threads per block, even though it is not dramatical. 

 I am confused by the performance, since the amount of my data is constant. Can anyone explain it for me?

 Thank you so much.


The hardware tries to maximize the load on each processor. So when some threads are waiting for a global memory read/write other threads are picked for calculation. The occupation of a multiprocessor depends however on how many shared memory each block uses and the amount of threads in a block (as there can be only 752 threads running at the same time. If you have a block of 512 threads it can schedule one block as with 256 it can schedule 3)., the register usage, etc. So more threads per block does not necessarily mean better performance. I’ve seen a dramatic speed up when switching from 512 to 256 threads per block even this give more global memory access overall.


@S.Warris: I think you meant 768 active threads per multiprocessor. :)

To be precise, this number is for devices with compute capability 1.0 & 1.1 (G8X & G9X). With compute capability 1.2 (GT200) the active threads increases to 1024. (programming guide 2.0, p.78f)


@recharge: of course! thnx :">

Thanks all.

But it is still not very clear to me. I found this in the programming guide–each block is split into SIMD groups of threads called warps; each of these warps contains the same number of threads, called the warp size, and is executed by the multiprocessor in a SIMD fashion; a thread scheduler periodically switches from one warp to another to maximize the use of the multiprocessor’s computational resources. For a G80 hardware, it has 8 ALUs per multiprocessor. Then how many processors per block or per SIMD group?

My understand is threads in blocks that are processed by one multiprocessor are parallel. Multiprocessors are scheduled. So if I want to increase performance, I need to maximize the number of blocks that can be processed by one multiprocessor (up to 8). Is this correct?

If this was correct, 3 blocks with 256 threads each are assumed to have a better performance than 1 block with 512 threads, right? I have two versions of my application. One implements shared memory, another doesn’t. I changed the thread # for the latter one from 512 to 256. The performance of the one has 256 threads per block is worse than the one has 512 threads per block. In contrast, the version which implemented shared memory gained a better performance in 256 than 512 threads per block.

Can anyone help? Thank you!!

In general, yes. But register and shared memory usage can also change occupancy. Download the occupancy calculator spreadsheet from the CUDA download links and see the forum sticky on this topic for more info.