Effect of launch bounds on register usage and spillage

Hello,

I have a kernel that uses 93 registers.

ptxas info    : 218125 bytes gmem, 920 bytes cmem[3]
ptxas info    : Compiling entry function '_ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_' for 'sm_80'
ptxas info    : Function properties for _ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 93 registers, 7136 bytes smem, 432 bytes cmem[0], 64 bytes cmem[2]

If I add launch bounds (1024,1), as expected, the register usage goes down to 64. However, I don’t see any spillage.

ptxas info    : Overriding global maxrregcount 255 with entry-specific value 64 computed using thread count
ptxas info    : 218125 bytes gmem, 920 bytes cmem[3]
ptxas info    : Compiling entry function '_ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_' for 'sm_80'
ptxas info    : Function properties for _ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 7136 bytes smem, 432 bytes cmem[0], 64 bytes cmem[2]

If I push the launch bounds to (1024,2) to restrict register usage at 32 per thread, I then see a register spillage.

I was under the impression that anytime I restrict the register usage using launch bounds, registers will spill, but it appears as if compiler can “may be” find an optimization to the code that reduces register usage without spillage? Could someone help me understand what may be happening under the hood? I am also attaching a snapshot of comparison of the live registers as seen in Nsight Compute.

That’s clearly not the case, so you should drop that notion. This is not the only example you might find.

You can find out what is happening under the hood using the CUDA binary utilities.

If you are looking for rationale, I would assume that the compiler is a complex entity, and it might find more than one way to reach a particular position, given different control inputs. The compiler generally seeks to optimize performance, as a primary (although perhaps not absolute) objective. Since the compiler does not have a way to perfectly measure performance, but undoubtedly has a heuristic, it might be that internally the heuristic told the compiler that the 93 register realization of your code would run faster than the 64 register realization, even though neither of them created spill activity. And of course you might actually measure it and determine, in your particular setting, that that is not the case. But the compiler might see things differently.

An example of how extra register utilization might come about:

Suppose I have entities A and B that I am using frequently. The compiler chooses to put those in registers. Now suppose I have entity C that I need occasionally. C=AxB. Since I am keeping A and B resident in registers, if I need C, I can compute it on the fly. If I compute it once, and need it later, then depending on register pressure, I might choose to keep the computed value around, until it is used later, or I might choose to recompute it later, to free up a register for other use. Neither of these approaches involve spill activity, but they end up having a different register footprint.

3 Likes

Another important example of a performance tweak needing registers is the assembler doing memory reads as early as possible. It is not necessarily fair to other warps, but at least the current warp will not have to block and wait for the memory contents to arrive.

If, on the other hand, you limit the number of registers, memory reading and using the result is moved closer together leading to the warp stalling for longer times (as long scoreboard stall). To compensate for it, you would need more occupancy (resident warps), leaving less registers per thread.

You have to decide, which way (more or less registers per thread) after optimization in each case leads to more performance for each of your kernels. Using more registers and less warps is difficult to optimize, as the kernel has to be quite explicit and use the additional registers in a very effective way to reduce stalling a lot (at assembly time; and it is difficult to control the specific register assignment and order of generated instructions), whereas less registers and more warps just use more warps to hide the latencies and average them out (at run time; that is easier as there are not many registers anyway per thread, so not very many variations how to optimize, as long as you avoid spilling, which sometimes is a challenge with few registers). The optimal sweet spot typically lies somewhere in the middle. For some lucky kernels not register-bound it also does not matter at all.

Each thread needs kind of a minimum amount of registers for basic functioning. So few warps means a lot of extra registers for memory, more warps means nearly no registers for memory buffering.
For very memory-bound kernels, it is often better to have less warps and more registers per thread in order to use as many registers used for memory loading as possible.
For more compute-bound kernels, you also want to hide fixed duration latencies of compute instructions and have at least a medium amount of warps resident, sacrificing the extra buffer space for memory loading.
By specializing warps (to either do memory loading or computations) together with their synchronization you can combine the advantages of both approaches as the memory loading warps can use nearly all their registers as buffer space, needing less registers for the basic operation of the warp.
Or you can do it since Ampere by asynchronously copying global device memory into shared memory.

1 Like

@Robert_Crovella I have been running my kernel with 128 threads per block. So I reran the kernel by setting a launch bounds of (128,1), and that resulted in a register usage of similar to that obtained without setting launch bounds. So it is almost as if I had to lie to the compiler (tell it that I intend to run with 1024 max threads per block when I really intend to run it with 128 threads) to get better performance. Would you say this behavior is by design? Shouldn’t telling the compiler how I intend to run the code result in the most performant possible code?

I haven’t seen you mention anything about actual performance in this thread, until now. I cannot comment on anything that you didn’t ask about and have provided basically (still) no information about.

1 Like

No. The compiler’s heuristics are well developed at this point, and in most cases the compiler finds a near optimal (from a performance perspective) trade-off between occupancy and registers used per thread. The programmer interfering with this automagical process by adding additional constraints via the __launch_bounds attribute or the -maxrregcount compiler switch usually leads to worse performance.

As indicated by the qualifiers “most” and “usually” there can certainly be situations in which the compiler heuristics pick a sub-optimal balance between occupancy and register use per thread. If the resulting application-level performance difference versus a manually tweaked solution is large (by some definition of large, at least 5%), I would suggest filing an enhancement request with NVIDIA.

Not even the most sophisticated set of heuristics, however, is capable of producing near optimal results for 100% of cases. Profiler-feedback-directed optimization could potentially address some of the cases not handled well by heuristics, but I don’t think something like that has been incorporated into nvcc yet.

2 Likes

@Robert_Crovella Here are some performance details. Please let me know if any more performance details could help, and I can share relevant screenshots from Nsight Compute.

  1. My kernel without any launch bounds, is launched using 128 threads per block. It uses 93 registers, and has an occupancy of 20 warps/SM. The time as reported by Nsight Compute is about 420 microseconds. The compute throughput reported is about 33% and the memory throughput is about 77%.

  2. The same kernel launched on 128 threads/block, but this time with launch bounds specified to (1024,1) uses 64 registers, and has an occupancy of 32 warps/SM. The time as reported by Nsight Compute is about 355 microseconds. The compute throughput reported is about 39% and the memory throughput is about 93%.

I also try setting a launch bound of (128, 1) which results in the same performance as scenario 1 above. I am curious as to why do I obtain better performance when setting a launch bound of (1024,1) vs setting a launch bound of (128,1), considering that in both cases I am launching the kernel with 128 threads/block?

My guess as to the proximal answer is the higher occupancy. I can say that I have worked on a number of codes where I tried to “force” higher occupancy by limiting registers per thread, and in every case I can remember, the performance got worse. Obviously the case you are describing seems to be the opposite of that, so my confidence in the idea that higher occupancy can lead to higher performance is somewhat restored. Yay!

When I read the launch bounds section of the programming guide, I note this:

The right value for minBlocksPerMultiprocessor should be determined using a detailed per kernel analysis.

Have you done that?

Assuming you haven’t, I suspect that by specifying (128,8) instead of (128,1), you might get better performance, also. I won’t be able to give you a precise recipe for what that detailed per-kernel analysis might look like, but from a static perspective it could involve a register usage study, and from a dynamic perspective you could certainly just shmoo (i.e. try) a bunch of values, to see what the performance profile looks like as you vary that parameter. Given that you have chosen to use launch bounds in both cases that you asked about (the stuff I excerpted at the beginning of this entry), it seems to me like you need additional work to find the best perf.

A slightly different question you could ask, but I would say you did not ask, exactly, is:

“Why did I get better performance when I decorated with launch bounds, than in the undecorated case?” In that case, my response is, “somehow, the launch bounds aided the compiler heuristics”. Since the compiler heuristics are not published, I cannot provide any more description than that.

There’s still information you have not provided that might matter.

  1. Bugs are always possible. Are you using the latest CUDA version and latest GPU driver?
  2. Other compilation settings may matter. If you are compiling for sm_70 arch but running on a cc7.5 device, for example, that could be a factor. It looks like you are on a A100 (from the profiler picture) so compiling for sm_75 but running on a sm_80 device could have some bearing, for example.
  3. There might be other compilation settings that matter. For example, presumably you are not specifying (also) -maxrregcount which would further cloud things.

If you want to provide a complete test case, you can always file a bug.

And if you have not, yet, you might want to read the launch bounds section again, that I linked. There are other statements in there that might be worth testing, such as a suggestion to omit the minBlocksPerMultiprocessor argument altogether. I think depending on how you interpret the definition of that argument as well as the recommendations given, you might conclude that you are not really correctly informing the compiler when you specify 1 there.

The way I interpret minBlocksPerMultiprocessor is that you are making a statement to the compiler “do whatever is necessary so that I can have at least this many blocks resident per SM.” In that light, saying “I am OK with just one block per SM” vs. “I would like at least 8 blocks per SM” are two very different statements.

1 Like

It might be useful to note that the historical reasons for first introducing the -maxrregcount compiler switch and then the __launch_bounds() attribute were

(1) Register-starved GPU architectures still primarily designed around 3D graphics
(2) Immature toolchains (home-brew → Open64 → LLVM based) with mediocre machine-specific heuristics

Both of these problem areas had essentially been addressed by about 2014, so ten years ago. At which point these options for manual intervention by programmers became largely useless and their use often counterproductive. I would put them into the “things to try if everything else fails” category and personally cannot recall having used them over the past decade.

1 Like

Assuming you haven’t, I suspect that by specifying (128,8) instead of (128,1), you might get better performance, also.

Tightening the launch bounds to (128,8) is more of an apples to apple comparison. I tried that, and got the same performance as (1024,1).

  1. Bugs are always possible. Are you using the latest CUDA version and latest GPU driver?

I am on 12.2, but considering the above finding I doubt there is a bug.

  1. Other compilation settings may matter. If you are compiling for sm_70 arch but running on a cc7.5 device, for example, that could be a factor. It looks like you are on a A100 (from the profiler picture) so compiling for sm_75 but running on a sm_80 device could have some bearing, for example.

I am on A100, and did compile for sm_80

  1. There might be other compilation settings that matter. For example, presumably you are not specifying (also) -maxrregcount which would further cloud things.

I am aware of this, but didn’t try it because I have been focusing on a single kernel in a massive code, and I don’t have intimate control over compilation of individual kernels, and didn’t want to set the flag for the entire application.

Thank you for your detailed response. It all makes more sense now.

Aren’t they necessary to make sure the kernel runs at all with certain block sizes? E.g. if the kernel needs 1024 threads/block cooperating in some way, each thread cannot use 255 registers at the same time.

In my experience it is rare to have a hard lock-in to a very specific block size, rather block size is a variable design parameter. My recommendation would be to aim for the middle of the available block size range when designing, not the extreme ends. That said there can be reasons to chose very small or very large block sizes. With today’s copious register file sizes it does not happen often that one has to take recourse to __launch_bounds() even in those cases.

It is entirely possible that there are application areas where use of __launch_bounds() is somewhat common, and I just haven’t encountered them.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.