I have a kernel using a recursive device template function
(i.e. template void device InsertSorted(unsigned char* buffer,unsigned char number) )
a few days ago I installed CUDA 5 and noticed a considerable slowdown of my code (X8). when analyzing the code with Nsight I saw that the kernel is using 59 registers instead of 12 registers it used when compiled with CUDA 4. Naturally, this causes a slowdown due to low occupancy on the GPU.
Any ideas on why this is happening? Does the CUDA 5 compiler not support regular recursions now that dynamic parallelism is supported? is there any way to fix this?