Nvcc 13.2 spilling many registers for cc 120

Hello,

I have two fairly large kernels for which nvcc 13.2 fails to produce satisfactory assembly when targeting compute capability 12. I’m observing high register usage and register spilling which does not happen for previous compute capabilities (86 and 90).

You can have a look at the code here:

To keep the discussion short, let’s focus on the smaller kernel named “kernel_mlp_forward“.

Compiling with _launch_bounds_(1024, 1) :

nvcc Test.cu -arch=sm_90 --use_fast_math --ptxas-options=-v
ptxas info : Compiling entry function ‘_Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb’ for ‘sm_90’
ptxas info : Function properties for _Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 64 registers, used 1 barriers, 192 bytes smem
ptxas info : Compile time = 0.000 ms

nvcc Test.cu -arch=sm_120 --use_fast_math --ptxas-options=-v

ptxas info : Compiling entry function ‘_Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb’ for ‘sm_120’
ptxas info : Function properties for _Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb
480 bytes stack frame, 1368 bytes spill stores, 1488 bytes spill loads
ptxas info : Used 64 registers, used 1 barriers, 480 bytes cumulative stack size, 192 bytes smem
ptxas info : Compile time = 0.000 ms

When targeting cc 90, it fits in 64 registers without spilling. However, nvcc requires a large stack with cc 120, and the performance is about 8x slower on an rtx 5090, compared to an rtx 3090, which is unacceptable for my use case (real-time rendering).

Compiling without _launch_bounds_(…) :

nvcc Test.cu -arch=sm_90 --use_fast_math --ptxas-options=-v

ptxas info : Compiling entry function ‘_Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb’ for ‘sm_90’
ptxas info : Function properties for _Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 62 registers, used 1 barriers, 192 bytes smem
ptxas info : Compile time = 0.000 ms

nvcc Test.cu -arch=sm_120 --use_fast_math --ptxas-options=-v

ptxas info : Compiling entry function ‘_Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb’ for ‘sm_120’
ptxas info : Function properties for _Z18kernel_mlp_forward22Cuda_PerBatchInputData20Cuda_PerCamInputDatabbbb
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 144 registers, used 1 barriers, 192 bytes smem

In that case, the kernel uses 62 registers for cc90 by default, which indicates that a single group of 1024 is a good fit. However, the kernel uses 144 registers for cc120 which is counter-intuitive.

As for the second gpu kernel, nvcc uses 167 registers with cc90 (no spilling) and requires all 255 registers + a lot of spilling with cc120.

Why does cc120 requires more than 2x the number of registers compared to cc90 and cc86?

Lastly, nvcc 12.8 seems to spill to a lower extent (though still significantly) compared to nvcc 13.2, which indicates something has changed recently in the compiler internals.

I tried the workaround suggested here, which seems to work:

I compiled my code for ada architecture with set(CMAKE_CUDA_ARCHITECTURES “89-virtual”),

this way nvcc only generates pxt, which gets JIT compiled by the driver to sass.
I tested with the latest windows driver 595.79 and now the kernel takes .5ms instead of 8ms.