How Should maxrregcount Be Properly Set?

I recently came across a comment in another post (Differences Between Stack Frame, Spill Stores, and Spill Loads - #5 by Curefab) stating that if maxrregcount is set too low, excess registers will automatically spill to global memory. This raises two questions for me:

  1. How should maxrregcount be scientifically determined?
  2. Does excess register usage really spill automatically to global memory?

For the first question, I had previously considered using NVIDIA Nsight Compute (ncu) with its source code analysis feature to examine register usage during execution. However, this approach seems quite complex, and I haven’t actually tried it myself. By the way, I recently noticed that DeepSeek V3 uses a different approach for quantization—it employs one workgroup (WG) as a producer, another WG for WGMMA, and yet another WG for quantization scaling. This appears to be different from the CUTLASS WASP GEMM approach, and I wonder how they determine the value for maxrregcount.

For the second question, I think automatic spilling is not always guaranteed! Here’s a potential counterexample:
FlashAttention Example.

In this case, if the maxrregcount value is set incorrectly, it results in a CUDA error rather than smoothly spilling excess registers to global memory.

The scientific method is looking at it empirically: Try out different values and profile which work best (fastest).

If more registers are available, ptxas (the assembler) keeps more intermediate values, if less are available it recomputes some values. (E.g. if you use 2.0f * a and a several times, in one case it stores both values in registers, in the other case it only stores a.)

If more registers are available, for an urolled loop ptxas does memory loads earlier and has more loads in flight to better hide latencies, if less registers are available, memory loads are only done, when previous registers are free again.

So not only the spills change, but also the need for registers is adapted depending on availability.

If the additional registers are really needed, they are spilled to local memory (which resides in global memory).

Only certain “binary round” numbers of registers are supported for each architecture, so you cannot just specify any value or at least the specified value is rounded.

1 Like

To avoid misunderstandings: Despite its name, ptxas is an optimizing compiler.

With a sufficient number of empirical observations, one can create a rough model, and as an extension of that, develop heuristics for picking the “optimal” register count. Very conveniently for CUDA programmers, the results of such efforts have already been incorporated into pxtas. So (with apologies to D. Knuth) the rules for manually picking a register count are:

(1) Don’t do it.
(2) [Experts only!] Don’t do it yet.

Is it possible to find exceptions to the above rules, by successfully devising a custom strategy for picking the register count that benefits app-level performance across more than one generation of GPUs? Yes. Is it likely? Not in my experience.

2 Likes

haha, actually, I am writing gemm for several years. My job is to optimize it, so I have to set a proper number. For what I know, producer can use 24 or 32 regs, consumer can use 224 240 256 regs. (Learnt from FA3).