Hello,
I have a kernel that uses 93 registers.
ptxas info : 218125 bytes gmem, 920 bytes cmem[3]
ptxas info : Compiling entry function '_ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_' for 'sm_80'
ptxas info : Function properties for _ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 93 registers, 7136 bytes smem, 432 bytes cmem[0], 64 bytes cmem[2]
If I add launch bounds (1024,1), as expected, the register usage goes down to 64. However, I don’t see any spillage.
ptxas info : Overriding global maxrregcount 255 with entry-specific value 64 computed using thread count
ptxas info : 218125 bytes gmem, 920 bytes cmem[3]
ptxas info : Compiling entry function '_ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_' for 'sm_80'
ptxas info : Function properties for _ZN4pele7physics9reactions5utils19fKernelSpecOpt_CUDAINS2_7CYOrderEEEvidPKdPdS6_S6_S6_S6_S6_S6_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 64 registers, 7136 bytes smem, 432 bytes cmem[0], 64 bytes cmem[2]
If I push the launch bounds to (1024,2) to restrict register usage at 32 per thread, I then see a register spillage.
I was under the impression that anytime I restrict the register usage using launch bounds, registers will spill, but it appears as if compiler can “may be” find an optimization to the code that reduces register usage without spillage? Could someone help me understand what may be happening under the hood? I am also attaching a snapshot of comparison of the live registers as seen in Nsight Compute.