I want to learn more about the details of the PTX instruction setmaxnreg since wen I was programming a warp-specialized kernel on a Hopper GPU I got some dead-locks.
I looked at every details of memory barriers I used but it didn’t work. I then accidentally change some settings on setmaxnreg then the problem is solved.
In my program I use 12warps (3warp groups). I used tilelang for prototying, the pseudocode looked like this:
if(thread_id < 128) {
setmaxnreg.inc.sync.aligned.u32 224;
} else if (thread_id < 256) {
setmaxnreg.inc.sync.aligned.u32 224;
else {
setmaxnreg.dec.sync.aligned.u32 80;
}
and the compiled kernel uses 168 registers per thread.(as ptx-as shows, and I think this is the max number of registers for a block with 384 threads, since 65536/384=170, and the number of registers should be a multiple of 8.
Then when 2 warp groups both request 224 registers, and the other warp group will decrease registers to 80. It is (80 + 224 * 2)* 128=67584 registers in total, so warp group0 and warp group1 have a race for registers which causes the deadlock.
So my question is , as the documentation says`The setmaxnreg.inc instruction blocks the execution until enough registers are available in the CTA’s register pool. What will happen while the warp is waiting. Would it spill its registers in-use; or will other warp or warp groups spill there registers(or should programmers use setmaxnreg.dec.sync.aligned.u32 to make more available registers?
Also, I want to confirm that dead lock is possible when setting inappropriate maxnreg via setmaxnreg.inc.
Thank you .
https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg