register allocation behaviour

For example, I have 16 warps per SM. I have explicitly allocated static array in every thread inside the SM. At a certain point in the program, I am performing reduction, and thus only need 2 warps of 16 warps. And it would yield higher performance if the 2 warps can each use half of the available registers on the SM.

  1. For the scenario above, is it possible to disproportionately allocate ( i.e. some warps have more registers allocated while other warps have less / no register ) register per warp?

  2. In a finer granularity, is it possible to allocate different number of registers for each thread within the same warp?

  3. What is the granularity to which registers are allocated?


  1. No
  2. No
  3. GPU architecture dependent