Odd compilation output when limiting max used registers

GTX 980, Visual Studio 2012, Win 7 64 bit, CUDA 6.5

When compiled for 5.2 and setting max used register to 48:

1>      8 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
1>  ptxas info    : Used 48 registers, 2020 bytes smem, 376 bytes cmem[0], 12 bytes cmem[2]

Changing the max used register to 52:

1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas info    : Used 48 registers, 2020 bytes smem, 376 bytes cmem[0], 12 bytes cmem[2]

In the end for both compilations the number used registers used is 48 for that kernel, but in the first case it seems there are spills, while in the second there are not.

Not sure what to make of this, but the running time does vary slightly between the two, so wondering what is going on(the version with the max set to 52 is about 2.5% faster than the 48 version).

Is there any lesson to be taken from this, or should I ignore?

Based on my experience these are noise effects caused by the combination of different heuristics in the optimization phases of the compiler. What you are observing is a consequence of implementation artifacts that are likely to change from one version of the compiler to the next. That said, when trying to squeeze a couple of percent of performance out of a particular compiler targeting a specific architecture, it is often worthwhile to “wiggle” the register count targets a bit to exploit these artifacts. It should be understood that taking advantage of such artifacts is not robust in a longer time software engineering sense (not portable between architectures and/or compiler versions).

Some of the compiler optimization stages have no knowledge of register pressure, some of them may have an approximate notion of register pressure, only a few have an exact count of the registers used. The phases typically run in a particular order, although occasionally a phase may run more than once. Various phase orderings can result in different code with different performance, but most compilers use a fixed phase order that is based on particular dependencies and/or compiler engineer experience (also a heuristic of sorts).

So for most of these stages the impact of specific transformations on register pressure is not known (exactly), and the heuristics used may be too pessimistic or to optimistic in specific situations. The two primary ways in which the compiler tries to adjust to anticipated higher register pressure is re-computation of common sub-expressions, and register spilling. In the case of the CUDA compiler, it usually attempts the former first, and when that fails the latter kicks in. This is based on the assumption that re-computation is often cheaper than a spill/fill cycle. But less spilling does not necessarily mean faster code, since a higher dynamic instruction count caused by re-computation could be more detrimental to performance than minor spilling in specific instances.

The whole code optimization problem is in NP, which is why much of it is driven by heuristics that are used to ensure acceptable compilation times.

interesting, I had indeed been “wiggling” the count trying to squeeze out a bit more of performance…
Thanks