Hopper __launch_bounds__ and setmaxnreg conflicts?

Description

I have a GeMM kernel with 12 warps with specilization, and 8 warps as two consumers. The tile size is 192x192.

For the producer warpgroup, I use

cutlass::arch::warpgroup_reg_dealloc<40>();

For the consumer warpgroups, I use

cutlass::arch::warpgroup_reg_alloc<232>();

My NVCC version is 12.6.77. When I compile the kernel, it reports 170 register counts and the kernel cannot be launched.

It can be launched if I use __launch_bounds__(384, 1) for the reg usage hint, which consumes 168 registers then.

However, the kernel seems to hang at the first wgmma instruction after launching. My GPU is H800.

Another clue is that if I remove the reg_alloc and reg_dealloc lines and only keep the __launch_bounds__, the compiler reports severe register spilling.

My question is, are these two things can work with each other? What happend in this process? Why the kernel hangs if I use both of them?

there does appear to be an issue if you use dynamic register allocation on Hopper and also force the maximum register count to be below a certain level. I don’t have further information.

TBH, I’m not sure why providing a dynamic register allocation of 232 makes sense if you are independently causing the compiler to limit register allocation to e.g. 168.

You haven’t provided a compilable test case. But if you want to do that, you may wish to file a bug.

It appears that some waprs hang at the setmaxregcount point. That’s why warpgroup mma cannot be issued. __launch_bounds__ and setmaxregcount seem not to do the same thing. The former sets a limit for the whole cuda block and the latter gives a hint of register usage for each warpgroup. I then found that FA3 also adopts similar method: flash-attention/hopper/flash_fwd_kernel.h at 0dfb28174333d9eefb7c1dd4292690a8458d1e89 · Dao-AILab/flash-attention · GitHub

There are some other clues. If I decrease the setmaxregcount in the code, such as 160, then the kernel can be launched with much register spilling. But if increasing to 168, the kernel will hang.

Are there any debug methods to handle this? I also found that cuBLAS can handle this, the register count is 168 given 192x192 tile which is the exact hardware limit. Is there some magic?