I have been optimizing a kernel function for several days now, and I believe there is no more room for optimization in terms of mathematics and algorithms. It’s time to focus on CUDA programming for further optimization.
According to Nsight Compute, the tail effect is the biggest bottleneck.
My device is the AGX Orin, with 16 SMs. Currently, gridDim = dim3(121, 3), blockDim = 128 (4 warps).
I understand that the tail effect is caused by an imbalance in the workload of the last wave, and I may need to fill the remaining blocks in the last wave.
Tail Effect: Est. Speedup: 50%
A wave of thread blocks is defined as the maximum number of blocks that can be executed in parallel on the target GPU. The number of blocks in a wave depends on the number of multiprocessors and the theoretical occupancy of the kernel. This kernel launch results in 1 full waves and a partial wave of 171 thread blocks. Under the assumption of a uniform execution duration of all thread blocks, the partial wave may account for up to 50.0% of the total kernel runtime with a lower occupancy of 32.6%. Try launching a grid with no partial wave. The overall impact of this tail effect also lessens with the number of full waves executed for a grid.
You have two waves, one with 192 blocks, one with 171 blocks. That is quite balanced.
Not sure, where the 32.6% is coming from.
I would doubt that you achieve a speed-up of 50% without getting more work to the GPU with one invocation (e.g. two iterative kernel launches put into one kernel launch).
I was more thinking of combining with other launches to make the invocation more efficient.
With two waves the amount of tail effect is higher than with more waves. Or with parallel kernel launches on other streams, etc.
If you want to explore it further, the usual suggestion (I think) would be to rewrite the kernel as a grid-stride loop, and choose a grid size to exactly fill your GPU. That will reduce your kernel to a single wave. The 50% number is based on the idea that that will be the most efficient launch (with respect to occupancy and the tail effect), and “could” result in a kernel duration equivalent to a single wave of your current realization.
That could be viewed as an “upper bound” or best case outcome. So if you wish, interpret the nsight compute suggestions that way. The percentage speedup is “the most you could possibly get from this change, if all other conditions were perfect for the effect.”
Since nsight compute can’t (currently) do the level of analysis needed to accurately predict the actual speedup from a complex refactoring, then it gives the output in that sense.
After switching to grid-stride and reusing threads, the duration dropped to 18 microseconds, showing 25% improvement. Additionally, the Tail Effect: Est. Speedup: 50% suggestion has disappeared from the optimization opportunities section.
I’m curious whether the 50% improvement mentioned in the NCU report can really be achieved.
Hi,
with just two original two waves, one should in general take special care, when using grid-stride loops, how the work is distributed on different SMs and SM Partitions to avoid that one SM Partition has finished early and others have several more blocks to do (which is like an implicit tail effect). That is especially relevant, if one SM does more than 4 warps (i.e. one SM Partition has more than 1 warp).
In your case, each SM Partition has one warp, so it is less relevant.
I wouldn’t really expect much gains from the grid-stride refactoring by itself, for a case like this (2 waves). As curefab points out, you may not be changing the work breakdown structure very much, and even though you have eliminated the “wave effect”, you probably haven’t done much to actually make the imbalance go away. Naturally this would depend to some degree on how much work you are actually doing per element or per thread. If the work per element or per thread is “large” then the refactoring is likely to make little difference.
To see something approaching a large difference in performance (like 50%) you would need a situation where the work per thread or per element was almost zero, such that the overhead of work scheduling (e.g. deposition of blocks, traversal of loops, etc.) was dominating your performance. Such characteristics are not a hallmark of good CUDA code anyway, although people sometimes wrestle with those cases as well.