How are the registers in a kernel mapped into hardware resources?

According to the programming guide, a thread can use at most 255 registers, and a block can have at most 1024 threads. But the registers in a block are only 64K, not 255K, so the two upper bounds can not be both be achieved, is that correct?
If so, a 1024 thread block can only have 64 registers per thread, not 255 registers. So there are different mappings under the two circumstances. But how is it done?

You are correct.

The hardware limit is the registers per SM. On the SM possible different kernels with a certain number of blocks with each needing a certain numbers of registers per thread is run.

What do you exactly mean with “how is it done”?

There are a few discrete values, how many registers a thread can use. The Occupancy Calculator (included as a button in the top middle of the Compute Nsight or available as Excel Table) can show you legal combinations between number of threads and number of registers for different compute capabilities.

You can specify the number of registers with --maxrregcount or with __launch_bounds__ and display it with --verbose.

For each assembled kernel, the resource usage (registers, shared memory) is stored and considered, when the kernel is invoked.

The hardware switches, whether the registers are used for the same block or for different blocks.

If a kernel function uses 64 registers: r0, r1, … r63. When the kernel is running, the registers of thread 0 will use physical register phy_r0, phy_r1, … phy_r63, and thread 1 uses phy_r64, phy_r65,… ,phy_r127.
If the kernel uses 128 registers, and thread 0 uses r0, r1, …, r127, mapped to phy_r0 to phy_r127, then r0 of thread 1 must be mapped to phy_128. Did I understand correctly?
But in the SASS code, is there such a per-kernel mapping table? In the first kernel, thread 1 r0 is phy_r64, and in the second kernel, thread 1 r0 is phy_r128, how is that done? By compiler, embedding a special mapping table in the SASS, or by hardware?
Or did I misunderstand it completely, the mapping is always the same, when warps are scheduled up/down, the context of all registers are saved in stack? I think that is too complicated for gpu to be used, but I might be very wrong.

Hi XXD,

this mapping is done in the hardware.
And it is a mapping. It would be too slow to store and restore the registers in a memory stack.

(However we do not know, whether all registers are internally transferred to a (fast) buffer and back. But probably not. It depends on how the pipelines within the SM work, e.g. the timing of the register accesses. Also how many read ports the register files has, and if it is ‘easier’ to transfer all registers to a buffer with more read ports. Switching between warps has no additional overhead. So a transfer would have to be able to occur every cycle.)

Each SM is partitioned into 4 SM Partitions.
Each SM Partition has 16K registers.
Warps of 32 threads are executed as unit, when being executed you call them 32 lanes.
So there are 512 registers per lane per SM Partition.

Each SM Partition has several resident warps, which can be from the same or different kernels and from the same or different blocks.

So you actually need a mapping for each thread from the registers of a kernel to the 512 available registers.

This mapping only has to be done for the higher bits (of the register number), the lower bits are just kept the same (between logical and physical register number). That is one of the reasons, there are only certain combinations of registers/thread available. The mapping is not so complicated, just a few logical gates (presumably, I do not know Nvidia internals).

The SASS code contains register numbers between R0…R255 (R255 is not an actual register, but the zero register RZ, because 0 is often needed as a constant).

Thanks a lot for your reply. I think I’ve got the gist.