Why would recycling registers increase register count?

I was given a kernel that nvvp said used 46 registers. So I looked at the mathematics of the kernel, saw that the equivalent of 7 floats were being wasted, and re-wrote the kernel recycling a float4 and not using 3 floats that were not necessary for the computation. Now nvvp says that the kernel uses 48 registers.

How can this happen?

The PTX code you get out of nvcc uses single static register assignment. Meaning each new variable is given a new register, in order of appearance. You can use nvcc’s -keep option to keep the PTX files around after compilation.

The PTXAS component of the toolchain (or the JIT compiler of the driver, depending on scenario) then runs its own optimizer for making best use and reuse of the available register file, given the kernel launch bounds and other parameters. So PTXAS really isn’t an assembler, but actually an optimizing compiler.

Any changes you’ve made would have affected the PTX code (where indeed the total registers used may have been reduced somewhat). But then the PTXAS tried to ran its own optimization on this code and came up with a less optimal solution given this PTX input.

Regarding your specific launch configuration, 46 vs. 48 registers may not make a runtime speed difference at all (i.e. the possible threads per block with this kernel is likely unaffected).

you could attempt the --maxrregcount option of the compiler to force a register limit on a per (.cu) module basis - or alternatively set kernel launch bounds to allow more blocks to run simultaneously. This would also have the effect of forcing the register count down, at the expense of more stack, local memory and L1 cache use.