Description
I have a GeMM kernel with 12 warps with specilization, and 8 warps as two consumers. The tile size is 192x192.
For the producer warpgroup, I use
cutlass::arch::warpgroup_reg_dealloc<40>();
For the consumer warpgroups, I use
cutlass::arch::warpgroup_reg_alloc<232>();
My NVCC version is 12.6.77. When I compile the kernel, it reports 170 register counts and the kernel cannot be launched.
It can be launched if I use __launch_bounds__(384, 1)
for the reg usage hint, which consumes 168 registers then.
However, the kernel seems to hang at the first wgmma
instruction after launching. My GPU is H800.
Another clue is that if I remove the reg_alloc
and reg_dealloc
lines and only keep the __launch_bounds__
, the compiler reports severe register spilling.
My question is, are these two things can work with each other? What happend in this process? Why the kernel hangs if I use both of them?
there does appear to be an issue if you use dynamic register allocation on Hopper and also force the maximum register count to be below a certain level. I don’t have further information.
TBH, I’m not sure why providing a dynamic register allocation of 232 makes sense if you are independently causing the compiler to limit register allocation to e.g. 168.
You haven’t provided a compilable test case. But if you want to do that, you may wish to file a bug.
It appears that some waprs hang at the setmaxregcount point. That’s why warpgroup mma cannot be issued. __launch_bounds__
and setmaxregcount
seem not to do the same thing. The former sets a limit for the whole cuda block and the latter gives a hint of register usage for each warpgroup. I then found that FA3 also adopts similar method: flash-attention/hopper/flash_fwd_kernel.h at 0dfb28174333d9eefb7c1dd4292690a8458d1e89 · Dao-AILab/flash-attention · GitHub
There are some other clues. If I decrease the setmaxregcount
in the code, such as 160, then the kernel can be launched with much register spilling. But if increasing to 168, the kernel will hang.
Are there any debug methods to handle this? I also found that cuBLAS can handle this, the register count is 168 given 192x192 tile which is the exact hardware limit. Is there some magic?