Let’s say I have a kernel that takes one of several paths, A, B, C, or D. It’s getting close to the register limit with the number of threads I’m keen to run it on. Now, let’s say I want to add a fifth path that the kernel can take, E, and that finally starts to generate some register spills.
Will the kernel continue to run as fast as it used to if I launch it in such a way that it takes only paths A, B, C, or D? Can I count on the register spills being confined to path E? Or, will the spills be anywhere and everywhere?
Impossible to predict based on the high-level description. Generally speaking, the compiler tries to be smart by spilling where it hurts performance the least, e.g. not inside an innermost loop. Given that this is driven by heuristics, imperfect solutions are bound to occur. In my experience, spilling of 2 to 3 registers is often harmless from a performance perspective. Profiling after every incremental code change will help avoid unpleasant surprises.
You may want to experiment with the [[likely]] and [[unlikely]] attributes for branches in your code. This is a C++20 feature and is at least syntactically supported by the CUDA compiler. From experiments with recent versions of Clang I can see that the use of the attribute can make a difference in generated code, and since Clang and the CUDA compiler are built around the same LLVM framework this may be the case for the CUDA compiler as well, although I have yet to see concrete evidence of it.
Other things you could try is swapping limited amounts of floating-point computation for integer computation or vice versa; use of some mild ad-hoc data compression such as squeezing two integers of limited range into a single 32-bit quantity; or trying various sub-components of -use_fast_math individually if not already in use. The sub-components are: (1) FTZ mode, (2) approximate square root, (3) approximate division (4) use of device function intrinsics in place of certain math functions. Obviously some techniques, while achieving a slight reduction in register use, could drive up dynamic instruction count, which in turn could hurt performance.
If there are loops in this code, you could also experiment specifying various amounts of loop unrolling for them with #pragma unroll, and if there are device functions that normally get inlined you could experiment with marking some of them __noinline__. The outcome of those experiments could go either way, that is, either increasing or decreasing register pressure, and improved or worse performance.
I’ve rejiggered my approach to spread the work over four threads as opposed to one, and in the case of the newly added “path E” that will mean four threads are working with three atoms (the fourth is largely idle) as opposed to one thread working with three atoms. Since there can only be at most 1088 atoms, and each group of three is necessarily unique, the earlier approach would only have engaged at most 360 threads when it went path E. That reduced my register spills, but not completely, so I backed off the number of threads per block (2 x 448 = 896, down from 2 x 512 = 1024), and that’s eliminated the register spills. My hunch is that it’s better to be doing it this way, four threads working instead of the one, because even if I do eventually make a separate kernel for path E, this technique I’ve added is pretty register-heavy in the single-threaded implementation. The modestly parallel approach takes advantage of a relatively clean way to slice it up, and may be worth a mention in the next paper I write.