CUDA kernel block size tuning with maximum theoretical occupancy


Assume a kernel uses 24 registers per thread and no shared memory. A Kepler GPU (GTX 680) with CC 3.0 has 8 Multiprocessors.

I am tuning the kernel block size for “good” occupancy (I understand this is different than execution time). After playing with the excel calculator, I found multiple options that can achieve 100% occupancy, such as with block size of 128, 256, or 512. All these 3 configurations can achieve maximum threads and warps per SM. The only difference is the achievable active thread blocks, which is 16, 8, 4 respectively.

Now I assume these 3 configurations all give me good hardware utilization, and I can pick one that is fast. In this case, should I use a large thread block 512 or a small block 128? What should I consider now? is it the hardware overhead for block switching? does the “large block generally hide better latency” still hold here? Because eventually all these configurations yield the same number of warps being scheduled.

After some benchmark (a simple windowing operator on a gray scale image), I found the following result:

Execution time: 128 (1.049 ms), 256 (1.063 ms), 512 (1.101 ms). All these are the median of 50 runs each.

Achieved occupancy: 128 (82.1%), 256 (79.1%), 512 (77.6%). From Nvprof.

The result indicates me to go for smaller blocks. Seems like more small blocks rather than less large blocks achieve better occupancy. But my initial questions are still largely unanswered. What am i missing here?


People who are going after the last 5% of performance often do “tuning” like you have already done. That is, test different launch configurations (and perhaps other parameters) to see what works best. Performance tuning generalizations are harder to make, and harder to generally apply, when going after the last ounce of performance. Sometimes the best way to find optimality is not to attempt to derive it from first principles, but instead just study the landscape exhaustively until you find it.

You may wish to study the “cuda tail effect” (just google that please). It is often a factor in this kind of ninja-level tuning.

Suppose you have a threadblock of 512 threads, i.e. 8 warps. Now suppose the execution duration for each warp is different. Therefore one of the warps will have the longest execution duration. Over the execution duration of that warp, other warps will execute, and then retire, meaning they are no longer schedulable. If this effect is prolonged, it can lead to “starvation” in the SM and lower performance. Effectively, until the entire block retires, it may be impossible (i.e. not enough resources) for the block scheduler to schedule a new block.

Now suppose we replace that 512 thread block with 4 blocks of 128 threads, ie. 4 blocks of 2 warps each. The tail effect is still present, but the ramifications of it are reduced. A given long-duration warp only impacts itself and one other warp. Retirement of a threadblock now only requires 2 warps to retire. This can result in higher performance, i.e. on average, more warps are available to be scheduled.

Whether or not the tail effect is relevant for your code, I cannot say. And there may be other ninja-level tuning topics that are relevant for someone going after the last 5%. This presentation covers several ninja-level tuning topics, not all of which are pertinent to the observation you have:

I would take a look at the waves and tails discussion in that presentation. The tail effect (i.e. underutilization of the GPU at the tail end of block or grid execution due to grid scheduling characteristics taking into account work size and work characteristics) has various considerations beyond just the one I discussed.

This blog article may also be of interest:

Thanks Robert, that is very useful information. I will look into that.

Hi Robert,

is pretty interesting, thanks for digging it up and sharing it.

I understand that both hardware and software have made quite some progress in the past 6-7 years… do you know if there is a similar talk targeting Volta and Turing ?

Thank you,