--ptxas-options=-v info inquiry

I uses ptxas and get some results:

Version 1:
64 bytes stack frame, 76 bytes spill stores, 264 bytes spill loads

Version 2: (I uses a larger smem and…)
64 bytes stack frame, 72 bytes spill stores, 252 bytes spill loads

Version 3:(rewrite the code somehow…)
48 bytes stack frame, 100 bytes spill stores, 124 bytes spill loads
ptxas info : Used 128 registers, 416 bytes cmem[0]

We can see the spill stores becomes lower and then larger. And also noticed a huge cmem. Is larger cmem worse?

Thank you!!!

Also, how to reduce spill? Any idea? Thanks!!!

You can find many questions here on these forums discussing spill loads/stores

1 Like

spill stores and loads are bad, (slow) Nvidia local memory is used. Most probably you are indexing arrays within C/C++ local memory.

1 Like

How big the negative impact from registers spilling (due to running out of registers to use) is depends very much on the specific context. From observations, and speaking generally, the CUDA compiler is smart as to where to spill. For example, in a deeply nested loop nest it will try to spill in the outermost loops only, in which case performance impact is likely minimal. Likewise, spilling a few bytes (e.g. 4, 8) is of no concern most of the time.

100+ bytes worth of spill storage however is cause for concern. All GPU architectures supported by CUDA 12 (compute capability >= 5.0) make 255 registers available for programs to use. The fact that ptxas here reports 128 register used plus spilling therefore suggests that the programmer is imposing a limit of 128 registers. If so: Don’t do that.

Without access to the code in question, it is not really possible to say where the high register pressure originates. nvdisasm can show the life time of registers which can be used to find the “fat” point(s) in the code and work backward from there.

Among common optimizations, loop unrolling in particular can drive up register usage, for example by scalarization of small arrays and by providing more opportunities for loads to be schedule early (this is a useful optimization but frequently requires additional registers for temporary storage from point of load to point of use). #pragma unroll can be used for fine control of loop unrolling.

Likewise, function inlining (which the CUDA compiler uses extensively) can cause register pressure to increase. This can be controlled on a per-functions basis with the __noinline__ attribute.

1 Like