a few days ago I installed CUDA 5 and noticed a considerable slowdown of my code (X8). when analyzing the code with Nsight I saw that the kernel is using 59 registers instead of 12 registers it used when compiled with CUDA 4. Naturally, this causes a slowdown due to low occupancy on the GPU.
Any ideas on why this is happening? Does the CUDA 5 compiler not support regular recursions now that dynamic parallelism is supported? is there any way to fix this?
Is this function recursive or end-recursive? In the latter case the compiler may have turned it into an iteration in one case but not the other. I doubt that it has anything to do with the introduction of dynamic parallelism, which is restricted to sm_35, an architecture that wasn’t supported by CUDA 4.0.
Is the above comparison data from a controlled experiment, where other than a change in toolchain no other changes whatsoever have occured? In particular, no changes to the source code and no changes to compiler switches, e.g. optimization level, -use_fast_math, architecture target)? If so, this looks like a candidate for a compiler bug and I would recommend filing a bug report.
The comparison data was part of a controlled experiment.
It was an “end recursion” and unrolling it in a loop did “save” registers compared to letting the compiler do it.
There is definitely a difference between the way CUDA 4.0 and CUDA 5.0 optimize this code either when it is written in recursion or iterative loop. I’m no expert on PTX code, but looking at it, it seems to me that CUDA 5.0 compiler does not optimize register use as it “should”.
Further more, it seems that switching the GPU debug info flag (-G) yields a code with less registers (17 instead of 41). Weird…
All in all, it seems like very strange behavior of the new CUDA compiler…
One cannot determine anything about register usage from looking at PTX. The CUDA toolchain generates PTX in SSA form, where a new virtual register is assigned for every result written. A virtual register is just a typed variable name. Allocation of 32-bit physical registers happens during the translation from PTX to SASS (machine language), which is performed by the PTXAS component of the compiler.
Without access to compilable source code it is impossible for me to comment on what SASS register use would be “reasonable” for this code.