Slow Compilation with multiple calls of same function

I got a problem with one of my device functions. I have to call this function on the output of the previous run again and again, so splitting the Kernel and moving data back and forward is not really an option. But with each consecutive call within the kernel NVCC takes longer and longer to compile.

Calling the function once is compiled within a couple of minutes while calling it two times already takes almost an hour. The runtime is below 1/10 milliseconds, the kernel produces the expected results and cuda-memcheck doesn’t report any errors. With my current settings it’s far from exhausting any of the memory limits and calling the function again should not increase the ressource demands of the kernel.

I call nvcc like this:

nvcc -arch=sm_20 file

Does anyone have a clue what might be the cause of this exponential increase in compile time and how I might reduce it? Thanks in advance!

You did not specify the size of the kernel, but a compile time of one hour seems excessive (not to mention an obstacle to productivity). If this happens with the CUDA 4.0 toolchain, it would be helpful if you could file a bug against the compiler, attaching a self-contained repro case, so our compiler team can take a look. Please also state the compiler switches used by the build, as well as the platform you are abuilding on. Thank you for your help, and sorry for the inconvenience.