For example, I have 16 warps per SM. I have explicitly allocated static array in every thread inside the SM. At a certain point in the program, I am performing reduction, and thus only need 2 warps of 16 warps. And it would yield higher performance if the 2 warps can each use half of the available registers on the SM.
For the scenario above, is it possible to disproportionately allocate ( i.e. some warps have more registers allocated while other warps have less / no register ) register per warp?
In a finer granularity, is it possible to allocate different number of registers for each thread within the same warp?
What is the granularity to which registers are allocated?