SASS trivia -- max size of immediate in [reg+immediate] global load/store?

I have some kernels that perform loads/stores using a signed compile-time constant stride.

Inspection of SASS shows that the immediate appears to be capped at ~25 bits (sign + 6 nybbles)…

I ask because PTXAS starts gobbling registers when it transitions from [reg+immedate] to basic [reg] addressing.

Out of curiosity, can anyone confirm that the number of bits in the immediate is capped to less than 32 bits?

This is on sm_52.

If I read the source code for Scott Gray’s MaxAS correctly, the immediate offset is allotted 24 bits. I would expect Scott to take note of this thread and either confirm or refute these findings.

Great, thanks. That confirms it!

njuffa beat me too it. It’s a 24 bit 2’s complement value (can be negative). I’ve noticed in the past that the compiler will sometimes do the offset with IADDs for no particularly good reason… but this was a while ago and I don’t think I’ve seen it since.

Ah, that explains it.

You can see the compiler degrading from pure [reg+immediate] to pure IADD + [reg] ops as the stride increases.

Pure [reg+imm]:

Pure [reg] with a long sequence of register gobbling IADDs that precalculate the [reg] addresses:

In the intermediate sequences the compiler performs [reg+imm] ops around “center points”. Very clever:

I really don’t like how ptxas precomputes so many [reg] pointers even when there are launch bounds in place.

This eats up a huge number of registers. The reduction in occupancy is probably a net loss in performance.

I’m seeing the exact same kernel with different compile-time constant strides balloon from 158 to 255 registers.

Perhaps this observation explains some of the unexplained spilling issues others have seen elsewhere?