The CUDA compiler has two main components that transform code: The front-end framework called NVVM which is based on the widely used LLVM, and the back-end ptxas
that performs some general and all machine-specific optimizations. The interface between these two components is PTX, which is a virtual instruction set and and a compiler intermediate format. PTX never interacts directly with the hardware registers. It only uses virtual registers in SSA (single static assignment) fashion, meaning the result of every operation is assigned to a new virtual register. ptxas
is responsible for allocation of physical registers, as this is GPU architecture specific.
It is difficult to form a good hypothesis based on the scant information provided, but I think it is possible that using tons of inline PTX code is a part of the problem, in that many powerful high-level optimizations otherwise performed by NVVM are impeded. Generally speaking, inline PTX code should be used sparingly, for example to access functionality not efficiently expressible at C++ level or via intrinsics.
All thread-local variables are by default assigned to local memory. ptxas
decides as part of optimizations which of these should be pulled into registers. Scalars and small arrays with compile-time resolvable addressing are the usual candidates. The compiler tries to achieve the highest possible performance and considers trade-offs, e.g. massive use of registers reduces occupancy and potentially lowers performance. So if there are nested loops, it may assign variables from innermost loops to registers while variables from outermost loops remain in local memory.
The CUDA compiler is mature at this point and usually makes good decisions about register allocation. An observation that some variables are assigned to registers and others to local memory is, by itself, largely irrelevant. What is relevant is whether the choices made by the compiler negatively impact performance, and by how much. No specific performance data was mentioned in the starting post.
It is possible that optimization quality suffers for very large code. I cannot provide a crisp definition of “very large” but in general this would likely apply to single kernels consisting of several tens of thousands of lines of PTX, e.g. 40KLOCs. Programmers expect reasonable execution times from a compiler, and since the number of possible arrangements (instruction selection and ordering, register allocation) can grow rapidly with increasing code size, some optimizations phases may apply shortcuts if the compiler’s resource usage (time, memory) increases too much. This could result in less thoroughly optimized machine code.
It is also possible that your code is affected by a particular shortcoming of the compiler (inefficiency or bug) that has been addressed in the latest version of the compiler. Compiler engineers are primarily interested in issues observable with the latest shipping toolchain, CUDA 12.1 at present. So if possible, I would suggest trying that first.
I would further suggest use of the CUDA profiler to identify the bottlenecks in the code and observe whether and how bottlenecks and important performance statistics fluctuate as source code changes are made. This should result in better ideas what parts of a very large kernel may be involved in particular performance regressions.