What's the max register number that causes slowdown

Hello to everyone,
A100 has 256kb register file. We know we can’t schedule a warp if the register file is full, which will cause slowdown. I would like to learn how to calculate the maximum number of register that do not cause slowdowns.

If we consider that there are 4 active warps and calculate how many registers a thread can use, we get 256kb / 128 (threads) / 4 (register size) = 512. But 512 is too big to be true :)

If we take into account that there are 64 warp slots in the SM and do the calculation, we have 256kb / 2048 (threads) / 4 (register size) = 32. This figure makes more sense to me :)

Which one is right? Are there are any other registers than 32bit that are not visible to the users?

For sake of simplicity, let’s say I don’t use shared memory.

Thanks

From the documentation, A100 SM has 64K = 65536 registers, 2048 threads per SM, and the number of registers per thread is limited to 255.
So, to have 2048 threads per SM, they must use only up to 32 registers. If you only have 256 threads per SM, they can use all 255 registers.

The slowdown comes from spilling of registers to local memory, especially if they are not fully held by the L1 cache anymore (visible e.g. in Nsight Compute for local memory). You can provide the verbose option to nvcc to get information about spilling of your kernels. Your calculation is correct, except the additional maximum of 255 registers per thread and that only certain numbers of registers are possible (e.g. dividable by 4 except for 255) - you can get the exact possible numbers from the occupancy calculator. There are also uniform registers (shared by one warp) and special registers (e.g. for threadIdx / blockIdx / clock / …), but for general purpose calculations the 256kb registers are what you get. For shorter and longer variables also the 32bit registers are used.