What a pain in the ‘ptxas’…
Can’t catch a freaking break here, I’ve got a kernel, that was working fine. I figured out a way to rework some chunks of the code to increase ILP and speed it up. Now the new code does the exact same work as the old code it just layers the variables differently to minimize execution dependencies. I can replace a few chunks of the old math with the new chunk and everything is still fine. But if I replace all the old chunks with a new chunk I suddenly get slammed with the following error when I try to build:
error : 'ptxas' died with status 0xC00000FD (STACK_OVERFLOW) | CUDACOMPILE
That sounds… uh… not good.
I preemptively created as dirt simple of a test kernel as I could manage in the hopes I might try and repro and isolate the actual issue.(Hoping I could dance around it.) However I’ve got it boiled down to basically a small handful of working variables (all unsigned int’s to be specific) that just do the same 16 lines of math over and over. Once I paste enough of those chunks of math into the kernel the compiler blows up, all the time, everytime. Specifically it seems if I have anything that resembles a for() or while() loop after the compute chunks. Doesn’t even matter if I’m touching the working vars inside those loops or not, just the presence of a for() after all the math and the compiler fails. My actual kernel code also seems to blow up additionally from any kind of if() statement that checks any of the working variable values after the math. (The test kernel for some reason isn’t bothered by the if(), but there can’t be that many stackoverflow bugs hiding in the compiler to run into so I assume it’s the same issue.)
If I remove the loop: suddenly worky, worky. If I remove a few lines of the math: also worky, worky.
After spending hours playing around with the test kernel I can’t seem to find a way around this so…
Has anybody ran across a stackoverflow in the ptxas part of the compiler before???
Other Info:
I’m on Win7SP1 (x64), VS2013, and compiling for CC2.0, CC3.0, and CC3.5 all blow up, but CC5.0 somehow makes it through. Also I was on CUDA 8.0.44 when I hit this, but just upgraded to CUDA 8.0.61.2 and still no joy. (My main GFX card is Fermi or I would have jumped to the 9.somethin’)