CUDA kernel block size tuning with maximum theoretical occupancy

boq · June 10, 2019, 8:38am

Hi,

Assume a kernel uses 24 registers per thread and no shared memory. A Kepler GPU (GTX 680) with CC 3.0 has 8 Multiprocessors.

I am tuning the kernel block size for “good” occupancy (I understand this is different than execution time). After playing with the excel calculator, I found multiple options that can achieve 100% occupancy, such as with block size of 128, 256, or 512. All these 3 configurations can achieve maximum threads and warps per SM. The only difference is the achievable active thread blocks, which is 16, 8, 4 respectively.

Now I assume these 3 configurations all give me good hardware utilization, and I can pick one that is fast. In this case, should I use a large thread block 512 or a small block 128? What should I consider now? is it the hardware overhead for block switching? does the “large block generally hide better latency” still hold here? Because eventually all these configurations yield the same number of warps being scheduled.

After some benchmark (a simple windowing operator on a gray scale image), I found the following result:

Execution time: 128 (1.049 ms), 256 (1.063 ms), 512 (1.101 ms). All these are the median of 50 runs each.

Achieved occupancy: 128 (82.1%), 256 (79.1%), 512 (77.6%). From Nvprof.

The result indicates me to go for smaller blocks. Seems like more small blocks rather than less large blocks achieve better occupancy. But my initial questions are still largely unanswered. What am i missing here?

Thanks

Robert_Crovella · June 10, 2019, 2:05pm

People who are going after the last 5% of performance often do “tuning” like you have already done. That is, test different launch configurations (and perhaps other parameters) to see what works best. Performance tuning generalizations are harder to make, and harder to generally apply, when going after the last ounce of performance. Sometimes the best way to find optimality is not to attempt to derive it from first principles, but instead just study the landscape exhaustively until you find it.

You may wish to study the “cuda tail effect” (just google that please). It is often a factor in this kind of ninja-level tuning.

Suppose you have a threadblock of 512 threads, i.e. 8 warps. Now suppose the execution duration for each warp is different. Therefore one of the warps will have the longest execution duration. Over the execution duration of that warp, other warps will execute, and then retire, meaning they are no longer schedulable. If this effect is prolonged, it can lead to “starvation” in the SM and lower performance. Effectively, until the entire block retires, it may be impossible (i.e. not enough resources) for the block scheduler to schedule a new block.

Now suppose we replace that 512 thread block with 4 blocks of 128 threads, ie. 4 blocks of 2 warps each. The tail effect is still present, but the ramifications of it are reduced. A given long-duration warp only impacts itself and one other warp. Retirement of a threadblock now only requires 2 warps to retire. This can result in higher performance, i.e. on average, more warps are available to be scheduled.

Whether or not the tail effect is relevant for your code, I cannot say. And there may be other ninja-level tuning topics that are relevant for someone going after the last 5%. This presentation covers several ninja-level tuning topics, not all of which are pertinent to the observation you have:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-Analysis.pdf

I would take a look at the waves and tails discussion in that presentation. The tail effect (i.e. underutilization of the GPU at the tail end of block or grid execution due to grid scheduling characteristics taking into account work size and work characteristics) has various considerations beyond just the one I discussed.

This blog article may also be of interest:

https://devblogs.nvidia.com/cuda-pro-tip-minimize-the-tail-effect/

boq · June 17, 2019, 11:15am

Thanks Robert, that is very useful information. I will look into that.

fwyzard · June 17, 2019, 3:03pm

Hi Robert,

is pretty interesting, thanks for digging it up and sharing it.

I understand that both hardware and software have made quite some progress in the past 6-7 years… do you know if there is a similar talk targeting Volta and Turing ?

Thank you,
.Andrea

Topic		Replies	Views
How to use "block" and "thread" CUDA Programming and Performance	5	1324	October 16, 2013
Why some algorithm uses small block? Obviously the larger the better!? CUDA Programming and Performance	10	1211	September 28, 2023
Block size and occupancy CUDA Programming and Performance	12	679	January 2, 2025
Maximizing the number of threads per block leads to longer kernel execution times CUDA Programming and Performance cuda , kernel	12	2429	December 19, 2023
CUDA Pro Tip: Minimize the Tail Effect Technical Blog	2	625	June 6, 2014
How to determine the Block Size CUDA Programming and Performance	1	5966	September 4, 2009
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14202	March 29, 2010
Best performance with strange settings CUDA Programming and Performance	4	3546	May 20, 2009
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4561	February 3, 2009
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5989	July 25, 2007

CUDA kernel block size tuning with maximum theoretical occupancy

Related topics