How to optimize tail effect?

tnwilly · December 4, 2024, 9:58am

Hi Experts,

I have been optimizing a kernel function for several days now, and I believe there is no more room for optimization in terms of mathematics and algorithms. It’s time to focus on CUDA programming for further optimization.

According to Nsight Compute, the tail effect is the biggest bottleneck.

My device is the AGX Orin, with 16 SMs. Currently, gridDim = dim3(121, 3), blockDim = 128 (4 warps).

I understand that the tail effect is caused by an imbalance in the workload of the last wave, and I may need to fill the remaining blocks in the last wave.

I have also referred to CUDA Pro Tip: Minimize the Tail Effect | NVIDIA Technical Blog

I would like to know if my understanding is correct, and what would be the best approach for optimization.

Thanks!

Tail Effect: Est. Speedup: 50%
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 171 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 32.6%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid.

Curefab · December 4, 2024, 10:23am

3 * 121 = 363 = 192 + 171

You have two waves, one with 192 blocks, one with 171 blocks. That is quite balanced.

Not sure, where the 32.6% is coming from.

I would doubt that you achieve a speed-up of 50% without getting more work to the GPU with one invocation (e.g. two iterative kernel launches put into one kernel launch).

tnwilly · December 4, 2024, 2:15pm

Thank you for your reply!

This is a single kernel, so there’s no issue with iterative or consecutive launches.

In fact, I often doubt whether the optimization suggestions provided by Nsight Compute can actually be achieved…

Curefab · December 4, 2024, 2:42pm

I was more thinking of combining with other launches to make the invocation more efficient.
With two waves the amount of tail effect is higher than with more waves. Or with parallel kernel launches on other streams, etc.

Robert_Crovella · December 4, 2024, 5:50pm

If you want to explore it further, the usual suggestion (I think) would be to rewrite the kernel as a grid-stride loop, and choose a grid size to exactly fill your GPU. That will reduce your kernel to a single wave. The 50% number is based on the idea that that will be the most efficient launch (with respect to occupancy and the tail effect), and “could” result in a kernel duration equivalent to a single wave of your current realization.

That could be viewed as an “upper bound” or best case outcome. So if you wish, interpret the nsight compute suggestions that way. The percentage speedup is “the most you could possibly get from this change, if all other conditions were perfect for the effect.”

Since nsight compute can’t (currently) do the level of analysis needed to accurately predict the actual speedup from a complex refactoring, then it gives the output in that sense.

tnwilly · December 5, 2024, 7:18am

Thanks for the hints!

After switching to grid-stride and reusing threads, the duration dropped to 18 microseconds, showing 25% improvement. Additionally, the Tail Effect: Est. Speedup: 50% suggestion has disappeared from the optimization opportunities section.

I’m curious whether the 50% improvement mentioned in the NCU report can really be achieved.

Curefab · December 5, 2024, 12:20pm

Hi,
with just two original two waves, one should in general take special care, when using grid-stride loops, how the work is distributed on different SMs and SM Partitions to avoid that one SM Partition has finished early and others have several more blocks to do (which is like an implicit tail effect). That is especially relevant, if one SM does more than 4 warps (i.e. one SM Partition has more than 1 warp).
In your case, each SM Partition has one warp, so it is less relevant.

tnwilly · December 5, 2024, 12:25pm

I want to mention briefly that the previous results had some bugs.

The final results, which used grid-stride loops, did not show significant gains. I might share the code later.

Robert_Crovella · December 6, 2024, 3:09pm

I wouldn’t really expect much gains from the grid-stride refactoring by itself, for a case like this (2 waves). As curefab points out, you may not be changing the work breakdown structure very much, and even though you have eliminated the “wave effect”, you probably haven’t done much to actually make the imbalance go away. Naturally this would depend to some degree on how much work you are actually doing per element or per thread. If the work per element or per thread is “large” then the refactoring is likely to make little difference.

To see something approaching a large difference in performance (like 50%) you would need a situation where the work per thread or per element was almost zero, such that the overhead of work scheduling (e.g. deposition of blocks, traversal of loops, etc.) was dominating your performance. Such characteristics are not a hallmark of good CUDA code anyway, although people sometimes wrestle with those cases as well.

Topic		Replies	Views
CUDA Pro Tip: Minimize the Tail Effect Technical Blog	2	650	June 6, 2014
Elementwise kernel number of thread block CUDA Programming and Performance	5	224	July 31, 2024
CUDA kernel block size tuning with maximum theoretical occupancy CUDA Programming and Performance	3	1134	June 17, 2019
time problems with big grid CUDA Programming and Performance	12	1029	September 14, 2017
Performance degradation as task size grows CUDA Programming and Performance	13	805	April 25, 2023
Compute-Communication Overlap Causes Severe GEMM Performance Degradation because of Tail Effect CUDA NVCC Compiler cuda , kernel	0	226	July 4, 2025
Optimisation Tips for GK110 CUDA Programming and Performance	3	1466	March 19, 2013
Scalability evaluation CUDA Programming and Performance	12	4194	May 30, 2009
Diminishing Efficacy of Monolithic Kernels & Determining GPU Sweetspot CUDA Programming and Performance	1	801	August 21, 2014
Low processor efficiency with almost same CUDA kernels CUDA Programming and Performance	4	756	April 9, 2018

How to optimize tail effect?

Related topics