I wrote a matrix multiplication kernel. At first it used 124 registers and ran nice and fast.
And the nightmare came when I tried to initialize the result elements with some bias values instead of zero. I did this before anything else, so it shouldn’t alter the main logic at all. But actually the compiled kernel used 128 registers and with 10 registers spilled, and the spilling happened in the inner loop!
Can I tell the PTXAS, in some way, not to spill any registers? Just put the instructions in the best order, don’t alter my logic.
I found -O1 can stop register spilling. And the compiled binary is actually faster than the one compiled with default optimization level for the zero-initialized version, which doesn’t have the spilling problem.
What does -O1 do, and what does it lack of compared to -O3?