Is it possible to stop PTXAS from spilling registers?

I wrote a matrix multiplication kernel. At first it used 124 registers and ran nice and fast.

And the nightmare came when I tried to initialize the result elements with some bias values instead of zero. I did this before anything else, so it shouldn’t alter the main logic at all. But actually the compiled kernel used 128 registers and with 10 registers spilled, and the spilling happened in the inner loop!

Can I tell the PTXAS, in some way, not to spill any registers? Just put the instructions in the best order, don’t alter my logic.


I found -O1 can stop register spilling. And the compiled binary is actually faster than the one compiled with default optimization level for the zero-initialized version, which doesn’t have the spilling problem.

What does -O1 do, and what does it lack of compared to -O3?

If you have a well defined (e.g. fairly short, but complete) example of such a scenario, the usual suggestion is to file a bug at developer.nvidia.com

The differences in optimization level between -O1 and -O3 for PTXAS are not published AFAIK.

It partially doesn’t matter now. I found the JIT compiler, which I’d use in the real project, compiled the kernel differently and wouldn’t spill registers.

Well, I don’t think I can easily make a short example that use 128 registers :( The source of the kernel has more than 2400 lines

Actually the register spill problem seems to be a very general problem when register pressure is at some critical level. It was the major reason I gave up CUDA C and switched to PTX, that I expected I could walk around it by stating dependency chains more clearly. And it actually worked at some rate.

How did you know that 10 registers spilled?
Does the Nsight Visual Studio have more information about register spill?

Enable verbose printout from ptxas and this info will be printed.

Actually ‘-Xptxas -v’ or ‘–ptxas-options=-v’ options’ output is the form like that:
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
There’s no number of register spilled but the traffic of the register spilled.

So if you spill 12 bytes per thread to lmem and since the register banks are 4 bytes you can assume to have spilled 3 register.

This might be a good read:
http://developer.download.nvidia.com/CUDA/training/register_spilling.pdf

It’s more like how many times.
ok. Thanks.
Is it common that register spills in program? Actually I compiled several benchmarks but didn’t see any register spilled.

it’s quite common on large kernels and can sometimes be detrimental to performance.

On older archs before they were alleviated by any L1/L2/Texture caches register spilling was usually game over (IMO).

Regarding “how many times”, you might be able to get a clear idea based on the SASS.