99% of the code will be always the same. It is a “library of functions”.
1% of the code is generated at run time and really depend of the parameters (Shading language).
So, I would like to know if :
a) I can generate some CUDA code at run-time, then compile it
b) Avoid to include the “library of functions” and recompile it each time. But simply link it ! By example I only compile the Library, save it as PTX or any other format and later when I have to compile the remaining 1% I tell to CUDA to use this PTX.
Thanks… but it does not really help me to solve my problems (Current library compilation need 12 minutes to compile). So, I need a way to link my CUDA C code with the library (without including the library code).
I think the problem here is that CUDA has no linker on the device side, so the broken-into-chunks code cannot be linked again.
You should be able to work around this by compiling to PTX, then prepending your previously compiled PTX “library” and using inline PTX assembly to call the library routines.
nvcc inlines all the device code when it compiles it (hence there are no functions which might have been called by kernel code compiled separately). I do not know if nVidia are planning to change this? Would passing function pointers allow device functions to be called externally?
In my case each separately compiled chunk of code is a different kernel.
In the beginning, Jacket compiled CUDA on-the-fly. It was cumbersome, but was good enough to provide speedups. Several years ago, we switched to emitting PTX in both Jacket and LibJacket and the compile time is such a small fraction of the total wall time for most problems that it is as good as well-written statically compiled code.
I’m also a developer at AccelerEyes working with Jacket and LibJacket. Since most programs are running roughly the same code and data types, we typically see JIT using cached versions of instruction sequences. Only in the first few iterations would it recompile stuff, and then it just settles into a steady state pulling from cache. So the compile time disappears completely in Jacket and Libjacket.
Imagine that I have a shading language, so the user can ‘implement’ its own shader … so, I have the following workflow :
Shading language => CUDA => PTX => …
But the shading language has a lot of built-in functions. The problem is that today I have to redo the whole workflow each time the user change a simple line in his shader !
It takes 12 minutes to compile with only a few simple shaders !
What I would like is a kind of CUDA dll :-P This way I put all the functions in the CUDA_DLL and use it in my kernel.
BTW, I know for CUDA cache, it works very well… but it is not really a solution because I have to rebuild all in a lot of cases !
For now, I’m working with OpenCL, so rewriting everything in CUDA request a lot of work, and it is more work to generate PTX than CUDA !
I’m not sure I have the time and budget for this !
Read tera’s comment again and follow his advice. Once you have learned how to do it you’ll realize it’s really fast and efficient.
If the GPU’s page table remains unmodified for different kernels in the same context (memory permission remains the same), doing dynamic linking on the GPU should be as straightforward as that on the CPU. But I would never recommend somebody to go that way unless even ptx compiling takes unacceptably long as well, because function calling on the GPU is way more complicated than that on x86 CPUs due to the large number of registers that need to be taken care of when you transfer control from one piece of code to another.
NVIDIA claims ptxas can compile 10,000 lines of code per second (on a specified platform, of course), which should be enough unless you are writing a GPU operating system External Image
Funny that it reminds me about the idea of emulating a full 3D GPU, including shaders, using CUDA Kernels, to be able to offer full 3D GPU to VM with OS-independence (says a Windows VM using DirectX supported on a Linux system, transparently, offering “cuda-emulated” 3D driver on the VM OS that is executed on the host, without 3D API translation/emulation).
One of the interest is to be able to share a 3D card between many VM that runs 3D application, using concurrency of execution in CUDA 4.0.