Registers management 8.9 vs 9.0+

Hello community,
I think I understand the conceptual changes in register file organization between 8.9 and 9.0 and I understand that different allocation granularity can cause more leaks. But… Is there a way to fix this or at least try to avoid this problem? The same problem also occurs in Blackwell.
The same code compiled for 8.9 with a register limit of 128 does not spill, but compiled for 120 leaks ± 120 bytes; spills112 bytes using ccap100a and spills 180 bytes when compiled for ccap90.
Of course, even when I prepare a binary compiled with ccap 89 but with PTX, so JIT is possible, I see in the profiler that there are spills. Finally, the same kernel is slower on 5080 compared to 4070Ti because unexpected spills occur. Of course I can always increase the number of allowed registers (the lack of limit shows that compiling for 120 requires 164 registers), but this of course reduces the maximum number of allowed threads, which makes the whole thing even slower (in my case). We can of course say that newer cards are faster overall, but this is just a “vulgar display of power” - the final solution with spilling is less elegant and introduces unnecessary bottlenecks.
My questions are: did you encounter such a problem? How did you solve it? Were you still using the older architecture?

Commenting on register pressure issues in code I know nothing about would be pure speculation and a waste of everybody’s time. I see two constructive approaches here:

(1) Post the code in question in this forum so others can analyze it and experiment with it. I realize you may not be able to do that because the code is confidential or otherwise access-restricted.

(2) File a bug with NVIDIA, attaching your code and describing the (estimated) performance loss due to the changes you observe. If it is more than 5% it will probably be considered actionable. Access to the contents of bug report is limited to relevant NVIDIA personnel.

Yes, of course.
The question was rather, “What architectural changes between versions 8.9 and 9.0 necessitate a 50% increase in register allocation?”

In all likelihood there is a small number of people who can answer that question with any degree of authority after analyzing your code. These people are unlikely to be active in these forums. Based on historical observation, the compiler engineers responsible for register allocation would be highly unlikely to discuss the “why” here, even if they were active in this forum.

If you insist on speculation (not that this solves anything or benefits anyone), you could consider, for example:

(1) Design changes in the microarchitecture and / or the compiler lead to higher register pressure {on average | in a few cases}, however across the large universe of code running on GPUs provide better performance than is possible without these changes.

(2) There are bugs or suboptimally tuned heuristics in the { register allocation | instruction scheduling } portion of the compiler.

(3) There is a hardware bug in initial Blackwell GPUs that needs to be worked around and the workaround requires that a higher than desired number of registers be used.

etc. etc. etc.

What do you mean by leaking?

Just that more registers are needed?
That they are non-accessible or non-allocatable like leaked memory?
Or are you talking about spilling to local memory?

Look at the SASS code to see how registers are used.

Is this true for one specific kernel or a general trend?

I’m sorry for the confusion, I meant ‘spills’, not ‘leaks’. Lost in translation. The question is not about the spills themselves, but rather about the allocation in the registry, which seems to be magic knowledge.
It seems I am not the only one who asked about that (see below).
I will try to post an example later today or after the weekend.
In the meantime, could someone point me to a document where I can learn a bit more about the Hopper architecture, especially the changes to register allocation?

We can wait for an updated Toolkit, or create a minimal example to find workarounds or to file a bug report.

As the basic SM architecture (in so far we know about it) is the same, I would not believe that Nvidia generally increased the needed registers on purpose without any benefit.

@njuffa listed some possible reasons and they hint, that it could happen only for certain code - either because of some specific instructions or because some heuristic activates.