Overblown Stack Frame

I have a fairly simple kernel leveraging CUTLASS collectiveMMA to do a CTA-tiled 128x64x8 GEMM, where a CTA has 128 threads. If I compile this kernel “as is”, NVCC reports 8 bytes stack frame, implying that my register allocations, indeed are placed in registers, which I also confirmed from SASS.

Now, my workload involves doing this GEMM within a dynamic loop. I compile the same kernel but within a dynamic loop. Note that the index of this loop is not used for any memory addressing. This loop is strictly for control flow. NVCC now demands 1184 bytes of stack frame. WHY?

The SASS for this looped-kernel is fairly incomprehensible, as the STL and LDL instructions do not easily correlate with any meaningful memory access. That is, again, I confirm from the SASS that my register allocations are indeed placed in registers. So, the crucial question is: “what storage requires that much stack memory?”

The above kernel was set to a max of 128 registers and yields 39 us performance.

I removed the register constraint and NVCC uses all 255 registers, with stack frame now about 500 bytes. With this change, performance is largely similar at 33 us.

Help me understand, please, why does NVCC require so much stack frame that:

  1. seems to be unnecessary, given that without the loop, it demands only 8
  2. also does not impact performance by much? I mean, typically higher spilled loads translate to lower performance but here this trend does not apply at all.

I have not attached the code as it may be hard to understand for those unfamiliar with CUTLASS, but if you desire to see the code, let me know and I will edit this post to include it.

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

A stackframe could be used for calling device functions (if not inlined), it could be used because of dynamic addressing, or because of register spill.

It seems (as you have brought it down from 1184 to 500 bytes) that a large part comes from register spill.

Perhaps too many loops are unrolled and too much memory is prefetched into registers?

Hey, thanks for responding in such short notice!

All functions are inlined and more importantly there is no dynamic addressing in the code, which is why NVCC only uses 8 bytes when the outer loop is removed.

On the reduction, this in itself is anomalous because the code allocates two register arrays and for both cases of 1184 and 500 stack frame, SASS confirms both arrays reside in registers. It is very unclear what this stack frame is used for then.

Also, 255 registers is way above my kernel’s budget and will significantly degrade its occupancy. As mentioned above, restricting to 128 registers does not change performance by much, which again reinforces the point of this stack frame allocation being overblown.

This may be the culprit, as, indeed, most loops within the GEMM code are static and thus can be fully unrolled and have been annotated as such with #pragma unroll.

Logically following your point here, it may also explain the anomaly of NVCC only using 8 bytes for no external loop but 1184 with said loop.

Yet, I fail to see why adding the external, dynamic loop, would make the compiler increase the stack frame by so much. Specifically, this dynamic loop is not unrollable, so there should, ideally, be no stack frame difference between this and the code with no loop which demands 8 bytes. Ideally, both should be the same!

You can always use #pragma unroll 1 to prevent unrolling specific loops or use an unroll factor, e.g. #pragma unroll 2 for doubling the internals of the loop.

In this way, you can see the effect of loop unrolling on registers and stack frame.

1 Like

@Curefab Sorry, to belabor the point, but the code you recommend I reduce unrolling, is the same across both compiled codes which NVCC demands 8 bytes for one and over 1000 bytes for the other. The 8-byte code is the loop body for the other. No changes whatsoever.

Logically, wouldn’t you expect the SASS for the dynamic-looped code to compile to the same loop body but simply with a branch instruction at the end? To clarify, the dynamic loop is not unrolled.

I am more than willing to share my code if you are interested.

With #pragma unroll 1 you can make sure to prevent unrolling. That is to get more control. It is very well possible to not have any effect.

#pragma unroll 2 (and other unroll factors) is the interesting case. You have a difference between no unrolling and full unrolling. And by it you can test intermediate unrolling factors to see, what effect comes from dynamic loops themselves and what effect comes from longer code with e.g. the need of more registers.