Based on my experience these are noise effects caused by the combination of different heuristics in the optimization phases of the compiler. What you are observing is a consequence of implementation artifacts that are likely to change from one version of the compiler to the next. That said, when trying to squeeze a couple of percent of performance out of a particular compiler targeting a specific architecture, it is often worthwhile to “wiggle” the register count targets a bit to exploit these artifacts. It should be understood that taking advantage of such artifacts is not robust in a longer time software engineering sense (not portable between architectures and/or compiler versions).
Some of the compiler optimization stages have no knowledge of register pressure, some of them may have an approximate notion of register pressure, only a few have an exact count of the registers used. The phases typically run in a particular order, although occasionally a phase may run more than once. Various phase orderings can result in different code with different performance, but most compilers use a fixed phase order that is based on particular dependencies and/or compiler engineer experience (also a heuristic of sorts).
So for most of these stages the impact of specific transformations on register pressure is not known (exactly), and the heuristics used may be too pessimistic or to optimistic in specific situations. The two primary ways in which the compiler tries to adjust to anticipated higher register pressure is re-computation of common sub-expressions, and register spilling. In the case of the CUDA compiler, it usually attempts the former first, and when that fails the latter kicks in. This is based on the assumption that re-computation is often cheaper than a spill/fill cycle. But less spilling does not necessarily mean faster code, since a higher dynamic instruction count caused by re-computation could be more detrimental to performance than minor spilling in specific instances.
The whole code optimization problem is in NP, which is why much of it is driven by heuristics that are used to ensure acceptable compilation times.