I am converting some c code into cuda code and when the source file is very large with many calling depth, nvcc will freeze at the last phase for a couple of minutes before completion.
Is there a switch or configuration or certain coding style which can improve this?
nvcc inlines every function call, so if you have a large calling depth it has to insert code from a large number of function calls. This makes for an extremely long string of code to run through the optimizer and write out to the cubin.
There really isn’t any way around this. You could try the noinline compiler hint (see the CUDA programming guide 2.0 beta), but it is just a hint.
I found the compilation is really fast in device emulation mode. So I will stick with emulation in developing process and only compile for device in final test.