very slow compile

frank1985876 · February 6, 2012, 4:38pm

Recently, I have changed a C code to CUDA, however, it took a very long time to compile. Using “top” command, I found that after process “be” finished in serveral minutes, process “ptxas” continued to run and used most of the memory. The compile did not finished after one hour. Is there anybody encounter this problem?

My guesses are followed:

The host calls global function, and global calls device function, and then device calls device. Nest calls like this may include 10 layers and there are many calls of device function.
The C code used many global variables, so in CUDA code, I did not use cudaMalloc to malloc graphic mem. Instead, I declare “device *a;” and use “a = new float [n]” in global function to apply global graphic mem. n is a big number and there are many arrays like “a”. Thus I can use global graphic array “a” instead of passing it to many global/device function. During debug, I only kept code to malloc and free graphic mem, and I found the compile did not take long. Therefore, mem malloc may not be the key problem.
CUDA code used many macro. However, after I deleted all the macros, there was no fundamental change.

From the debug, the key problem may come from my first guess, or something else.
Is there anybody encounter problems like this?
Boss is very anxious to see the result, so please help!

tera · February 6, 2012, 5:50pm

If you are compiling for a compute capability 2.x device, your device functions are that deeply nested, and the compiler apparently inlines too much, it may be beneficial to declare a few device functions as [font=“Courier New”]noinline[/font] to break the hierarchy. That would not only benefit the compile time, but also relieve pressure on the instruction cache at runtime.

You might also want to check for how many different device architectures you are compiling ([font=“Courier New”]–generate-code[/font] or [font=“Courier New”]-gencode[/font] options to nvcc).

njuffa · February 6, 2012, 6:06pm

As tera mentions, the most likely cause of excessive compile times is massive code size due to inlining. This can be inhibited by the programmer through the use of appropriate attributes as mentioned.

If this issue occurs with CUDA 4.1, I think it would make sense for our compiler team to investigate given that the compile time clearly exceeds 10 minutes, and I would suggest you file a bug attaching a self-contained repro case. Thanks!

frank1985876 · February 7, 2012, 11:44am

Thank tera and njuffa! “noinline” indeed helps me to decrease the compile time, but it still costs 4 minutes to compile which I can not afford. Is there any way to make it shorter?

Besides, as I mentioned, I use a different way to malloc graphic memory.

I declare “device float *a;” outside of the function as a global array variable and use “a = new float [n]” in global function to apply global graphic mem. Thus I can use global graphic array “a” instead of passing it to many global/device function. However, n is a big number and there are many arrays like “a”. I may find a memory limit when I use this “new” method to malloc mem.

How can I remove this limit? Must I use cudaMalloc to malloc large gmem and pass it to the kernel? Becasuse there are so many arrays like “a”, the parameter list of the function will be much longer.

Where does the memory space of “a” come from? Global mem or somewhere like “heap” in C++?

Thanks!

tera · February 7, 2012, 12:34pm

You can control the amount of memory available for in-kernel malloc() via cudaDeviceSetLimit(cudaLimitMallocHeapSize, …).
For large arrays cudaMalloc() seems the more natural solution, particular if these aren’t per-thread allocations. If your parameter list for the kernel becomes too long, you can store all the pointers in a structure that you pass to the kernel instead of the individual pointers.

njuffa · February 7, 2012, 7:23pm

I still think it would make sense to file a bug here to have the compiler team look into the lengthy compile times. There may be a need to tweak heuristics or to further optimize certain compiler components, and this appears to be a valuable real-life test case. The compiler team may also be able to suggest additional workarounds to reduce compilation time (although compile time of 4 minutes doesn’t strike me as unusual for code that is fairly lengthy and involved, per the initial description).

There is a link to the bug form on the registered developer website. Please attach your original code (which took an hour to compile) and optionally, your modified version (which takes four minutes to compile). Thank you for your help.

frank1985876 · February 8, 2012, 1:10pm

Thank tera and njuffa very much!

cudaDeviceSetLimit works!

To njuffa, I am so sorry that I could not attach my code, at least now.

Thanks!

njuffa · February 8, 2012, 5:27pm

I should have phrased more clearly. I did not suggest posting the code in this public forum. I meant when you file a bug (which I still recommend here), please attach self-contained repro code to the bug report. Visibility of the contents and attachments of bug reports is restricted to the filer and NVIDIA personnel. I am happy to hear that with workarounds in place, you now get acceptable behavior.