I have a fairly simple kernel leveraging CUTLASS collectiveMMA to do a CTA-tiled 128x64x8 GEMM, where a CTA has 128 threads. If I compile this kernel “as is”, NVCC reports 8 bytes stack frame, implying that my register allocations, indeed are placed in registers, which I also confirmed from SASS.
Now, my workload involves doing this GEMM within a dynamic loop. I compile the same kernel but within a dynamic loop. Note that the index of this loop is not used for any memory addressing. This loop is strictly for control flow. NVCC now demands 1184 bytes of stack frame. WHY?
The SASS for this looped-kernel is fairly incomprehensible, as the STL and LDL instructions do not easily correlate with any meaningful memory access. That is, again, I confirm from the SASS that my register allocations are indeed placed in registers. So, the crucial question is: “what storage requires that much stack memory?”
The above kernel was set to a max of 128 registers and yields 39 us performance.
I removed the register constraint and NVCC uses all 255 registers, with stack frame now about 500 bytes. With this change, performance is largely similar at 33 us.
Help me understand, please, why does NVCC require so much stack frame that:
- seems to be unnecessary, given that without the loop, it demands only 8
- also does not impact performance by much? I mean, typically higher spilled loads translate to lower performance but here this trend does not apply at all.
I have not attached the code as it may be hard to understand for those unfamiliar with CUTLASS, but if you desire to see the code, let me know and I will edit this post to include it.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0