Long Compile Time after changing some float's to double's

Hi all,

I changed some device code to use double precision arrays instead of single precision arrays, and this has resulted in compile times of about 10 min for the device code, where it used to take about 1 min. Reverting the code to single precision restored the 1 min compile times. Has anyone else experienced a similar issue? This is using the CUDA 4.0 Toolkit, a Fermi board, and Windows XP 64 bit. Were you able to change some parameter or compile option to make the compilation process go faster, or understand what is the problem?


Are you using functions like sin(), exp() etc. a lot? As these get inlined and the double precision functions usually are longer to achieve the higher precision, the compiler has to work harder with double precision.

Yes I do have a few sin(), exp() etc., maybe that is the issue. Thanks for your reply!


My kernel also takes about 10+ minutes to compile.

My kernel uses double precision arithmetic like __dmul_ru(–), __ddiv_ru(–), nanf(). Can this be causing such high compilation times?

Also at the end of compilation it gives the error – "Entry function ‘FUNC_NAME’ uses too much local data (0x7490 bytes, 0x4000 max)

Can this be a cause for heavy compilation time?



Which compute capability is your device? This sounds more like a problem with excessive inlining (which also the thread starter might already have had). The functions you name that more or less directly map to machine instructions shouldn’t stress the compiler too hard.

If compiling for compute capability 2.x, try to declare a few strategic device functions as [font=“Courier New”]noinline[/font].

I think I partially figured out the reason for slow compilation.

My compile device was set to 1.3 - which has a limit of 4Kb on the local mem. I changed it to sm_20 which has a local mem limit of 521kb. This completely removes the delay. I guess when the compiler sees overshoot in local mem space it tries harder to fit everything in thus taking more time?

Anyway its better now.


If you are using CUDA 4.1, differences in build time may also be a function of two different frontends being used. For sm_1x, the Open64 frontend is used, while for sm_20 and higher the NVVM frontend is used. I am taking the fact that you are hitting the local memory limit as an indication that this is fairly hefty code (as tera explained, aggressive inlining can contribute significantly to code size). The long build times are likely simply a function of the amount of code and data that must be manipulated.

If you are seeing build times exceeding ten minutes per file on a reasonably fast modern system with CUDA 4.1, I would recommend filing a bug so the compiler team can investigate. Please attach a self-contained repro case that demonstrates the lengthy build time.