About setmaxnreg

I want to learn more about the details of the PTX instruction setmaxnreg since wen I was programming a warp-specialized kernel on a Hopper GPU I got some dead-locks.

I looked at every details of memory barriers I used but it didn’t work. I then accidentally change some settings on setmaxnreg then the problem is solved.

In my program I use 12warps (3warp groups). I used tilelang for prototying, the pseudocode looked like this:

if(thread_id < 128) {
    setmaxnreg.inc.sync.aligned.u32 224;
} else if (thread_id < 256) {
    setmaxnreg.inc.sync.aligned.u32 224;
else {
    setmaxnreg.dec.sync.aligned.u32 80;
}

and the compiled kernel uses 168 registers per thread.(as ptx-as shows, and I think this is the max number of registers for a block with 384 threads, since 65536/384=170, and the number of registers should be a multiple of 8.

Then when 2 warp groups both request 224 registers, and the other warp group will decrease registers to 80. It is (80 + 224 * 2)* 128=67584 registers in total, so warp group0 and warp group1 have a race for registers which causes the deadlock.

So my question is , as the documentation says`The setmaxnreg.inc instruction blocks the execution until enough registers are available in the CTA’s register pool. What will happen while the warp is waiting. Would it spill its registers in-use; or will other warp or warp groups spill there registers(or should programmers use setmaxnreg.dec.sync.aligned.u32 to make more available registers?

Also, I want to confirm that dead lock is possible when setting inappropriate maxnreg via setmaxnreg.inc.

Thank you .

https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg

I don’t think there is any spilling while waiting.

Do you just want to get a working configuration? Then reduce either the 224 or the 80 or increase the available registers or reduce the number of warps.

Do you want to reuse registers, i.e. one warp group uses them first, while the other blocks and then they transfer?

Or do you just want to understand the workings better?

I reduce the 224 & 80 to 200 & 64 after I compute the available registers.

And yes, I want to understand the workings better. Since setnmaxreg.inc would blocks the execution while setnmaxreg.dec would not, maybe blocking some warp groups with setmaxnreg.inc while other warp groups with setmaxnreg.dec are decreasing its registers is the way to reuse registers ?

Yes, they have to be used in combination.

Your pseudocode seemed to do that.

Is 200 a valid number, if not try 192. I don’t remember the step size

Thanks!

The step size is 8, as documented.