Hello,
I am profiling a kernel with Nsight Compute. This kernel is heavily latency bound as it requires a lot of registers. I am trying to use the Live register count to optimize it.
However, some of the per-line register count don’t make sense to me. For instance, see the image attached where:
On line 551 register count is 48, braces opens. In the braces, the register count increases which makes sense, brace closed on line 559. Then, on line 561 which is a comment, register count suddenly bumps to 112.
This does not makes sense to me, and I wonder if there is a misalignment of the live register with the source, due to the call to the inlined function “calcul_xg_tetra”. Could it be the case that there is a misalignment ?
Is there a specific thing to do to take these inline functions into account ?
Thanks in advance,
Rémi
I am compiling the code with nvc++ from Nvidia HPC SDK 25.3, with cuda 12.8 with drivers 570.133.07, on a Ubuntu 24.04 machine with a NVIDIA RTX 6000 Ada GPU. Moreover, the kernel is written using the Kokkos framework. My compilation flags include -lineinfo and my ncu run uses -import-sources. ncu version is 2025.1.0.0 (build 35237751) (public-release)
