efficiency of block/thread ratios

I am doing some processing on a large chunk of data which cannot fit into shared memory if I try to process it all with a single kernel call, however, the data can be broken down into smaller peices for independent processing. I have determined the maximum number of elements of data which can fit into shared memory per multiprocessor.

I am wondering if I would get better performance by using a larger number of blocks per multiprocessor with fewer threads each (8 blocks with 32 threads each) or a smaller number of blocks per multiprocessor with more threads each (1 block with 256 threads). Or does the answer lie somewhere in the middle?

Also, do I take any sort of performance hit for making many (>64) independent calls to the kernel? My feeling is that any performance lost would be more than gained back by being able to use shared memory instead of global (as I have been).

Thanks for any help.

There isn’t a quick answer to your question. My suggestion would be to try multiple configurations to determine what works best for your particular code. Some things to consider:

  • a block with many threads will incur a slightly higher overhead when threads synchronize. So, if your code requires many syncs, you may save some clock cycles if you use fewer threads.

  • the register file is partitioned among the threads of a given block. So, increasing the number of threads reduces the potential number of registers per thread. This may affect occupancy, or, in the worst case, cause a launch failure. The latter is not that frequent, plus it’s predictable since you can check how many registers are required per thread.

  • switching between warps of a block is fast. Switching between different blocks is slower. So, if thread-switching is the only concern (which it never is, see the above and memory latencies), higher number of threads per block leads to a shorter execution time simply due to lower switching overhead.


Thank you for the quick reply. I think my answer may lie somewhere in the middle, mostly due to the issue of register availability.