This is still true in Ampere. IIRC some tools report or store the overall numbers of used registers instead of just the regular registers.
I just looked at the paper to the talk:
We found that the cuobjdump -dump-resource-usage command (that prints a kernel’s register usage) reports a count that includes both regular and uniform registers. The upper limit of total registers used in any CUDA kernel is 256, unchanged from Volta.
We confirmed this result by patching the register count in the section header of a CUDA kernel to values above 256, and determining that cuobjdump only recognizes 256 registers at most.
(https://arxiv.org/pdf/1903.07486.pdf)
If this is true (and not just a limitation of the tools), the reason could be (just guessing) that on the hardware level, the regular registers of each lane store a copy of all uniform registers for faster access for the instructions that use both - regular and uniform registers.
But I believe not, then there would be a lot more SASS instructions being able to use uniform registers and they would (when reading) use the same encoding as when accessing regular registers. As in new CUDA architectures the SM does not have to check for instruction latencies and dependencies, as this information is stored in the instruction words, the SM would not need to know, whether a register to read is a general or uniform register. So the shadow register theory is unlikely.
Another theory would be that the compilation has several intermediate representations and a conversion of some instructions to the uniform datapath was included very late in the process, when the distribution of the data to registers and thus the number of registers was already fixed.
To prevent the compiled program from using more than 255 general registers, the maximum total as input to the compiler would have been limited to 255. But in this case, -maxrregcount would also limit the total number of registers.
A third theory is that the number of registers put into the object file was limited to 256 (not 255?) for compatibility reasons and it has nothing to do with the capabilities of the compiler or the GPUs.
We could hack together a few SASS commands that fill and read back all 255+63 registers and try it out.
Does anybody know the maximum number of uniform registers per SM? Or per block? We know the limit of 64 per warp due to instruction encoding. What if several warps run in each partition? Then the uniform registers would have to be stored in a register file.
In the Hot Chips 31 presentation of the Turing architecture they showed a slide with a regular register file of 64 kB (i.e. 4 Bytes * 32 lanes * 512) per partition and a uniform register file of 2 kB (i.e. 4 Bytes * 1 lane * 512) per partition. So from this slide it should actually be possible to use as many uniform registers as regular registers (per thread, of course they are shared within the warp, so there are 32x less), but up to a maximum of 63 (due to instruction encoding).
(RTX on—The NVIDIA Turing GPU | Semantic Scholar you can find the presentation also on Youtube and there is a PDF version)
Having the same overall register size (/32) means that there is no additional criteria for occupancy calculation and no different logic for register file division to warps.