Generate CUDA at run-time ?


My application has :

  1. 99% of the code will be always the same. It is a “library of functions”.
  2. 1% of the code is generated at run time and really depend of the parameters (Shading language).

So, I would like to know if :
a) I can generate some CUDA code at run-time, then compile it
b) Avoid to include the “library of functions” and recompile it each time. But simply link it ! By example I only compile the Library, save it as PTX or any other format and later when I have to compile the remaining 1% I tell to CUDA to use this PTX.


Tony E Lewis had a paper at this year’s CIGPU
which showed compiling PTX was rather faster than compiling CUDA C.

Thanks… but it does not really help me to solve my problems (Current library compilation need 12 minutes to compile). So, I need a way to link my CUDA C code with the library (without including the library code).

Hmm I guess I am misunderstanding.

You said:

Yes. I have created C code at run time (lots of printf)

and compiled the code using nvcc.

Tony has shown he can generate PTX at run time and compile that.

(Compilation of PTX is faster than compilation of C).

nvcc seems to have a prefered code size.

It may be you can speed up the compilation by breaking it into chunks

and giving each chunk to nvcc one at a time. Chunks of about 2000

lines each seemed like a good compromise. I am using gcc and Linux.

gcc (ie the unix linker) took very little time to link ten files

(plus my host code) into one executalble.

[Ages ago nvcc was compiling 329 C lines per second but this will be highly


I think the problem here is that CUDA has no linker on the device side, so the broken-into-chunks code cannot be linked again.

You should be able to work around this by compiling to PTX, then prepending your previously compiled PTX “library” and using inline PTX assembly to call the library routines.

The unix linker used to be quite happy to link together multiple .o files

created by nvcc. It seems to have no problem with make lines like:

/tmp/gp.exe: test.o testgpu.o gpu.o \

        /tmp/pop_0.o /tmp/pop_1.o /tmp/pop_2.o /tmp/pop_3. /tmp/pop_4.o        \

        /tmp/pop_5.o /tmp/pop_6.o /tmp/pop_7.o /tmp/pop_8.o /tmp/pop_9.o 

        $(CC) $(CFLAGS) -o /tmp/gp.exe test.o testgpu.o gpu.o \

        /tmp/pop_0.o /tmp/pop_1.o /tmp/pop_2.o /tmp/pop_3.o /tmp/pop_4.o        \

        /tmp/pop_5.o /tmp/pop_6.o /tmp/pop_7.o /tmp/pop_8. /tmp/pop_9.o        \

        $(NVFLAGS) -lm

where /tmp/pop_0.o etc are created by nvcc

Yes, but that links the host code, not the device code. Or have you found a way to call a device function from a kernel in a different file?

Ok. I think I understand your point.

nvcc inlines all the device code when it compiles it (hence there are no functions which might have been called by kernel code compiled separately). I do not know if nVidia are planning to change this? Would passing function pointers allow device functions to be called externally?

In my case each separately compiled chunk of code is a different kernel.

Thanks for all your answers,

Yes, the problem is that I need a “DEVICE LINKER” to link different PTX code for the device (not the host).

Is there a solution ?

In the beginning, Jacket compiled CUDA on-the-fly. It was cumbersome, but was good enough to provide speedups. Several years ago, we switched to emitting PTX in both Jacket and LibJacket and the compile time is such a small fraction of the total wall time for most problems that it is as good as well-written statically compiled code.

I’m also a developer at AccelerEyes working with Jacket and LibJacket. Since most programs are running roughly the same code and data types, we typically see JIT using cached versions of instruction sequences. Only in the first few iterations would it recompile stuff, and then it just settles into a steady state pulling from cache. So the compile time disappears completely in Jacket and Libjacket.

Thanks for clarrification,

Imagine that I have a shading language, so the user can ‘implement’ its own shader … so, I have the following workflow :

Shading language => CUDA => PTX => …

But the shading language has a lot of built-in functions. The problem is that today I have to redo the whole workflow each time the user change a simple line in his shader !
It takes 12 minutes to compile with only a few simple shaders !

What I would like is a kind of CUDA dll :-P This way I put all the functions in the CUDA_DLL and use it in my kernel.

BTW, I know for CUDA cache, it works very well… but it is not really a solution because I have to rebuild all in a lot of cases !
For now, I’m working with OpenCL, so rewriting everything in CUDA request a lot of work, and it is more work to generate PTX than CUDA !
I’m not sure I have the time and budget for this !

Read tera’s comment again and follow his advice. Once you have learned how to do it you’ll realize it’s really fast and efficient.

If the GPU’s page table remains unmodified for different kernels in the same context (memory permission remains the same), doing dynamic linking on the GPU should be as straightforward as that on the CPU. But I would never recommend somebody to go that way unless even ptx compiling takes unacceptably long as well, because function calling on the GPU is way more complicated than that on x86 CPUs due to the large number of registers that need to be taken care of when you transfer control from one piece of code to another.

NVIDIA claims ptxas can compile 10,000 lines of code per second (on a specified platform, of course), which should be enough unless you are writing a GPU operating system :pinch:

Funny that it reminds me about the idea of emulating a full 3D GPU, including shaders, using CUDA Kernels, to be able to offer full 3D GPU to VM with OS-independence (says a Windows VM using DirectX supported on a Linux system, transparently, offering “cuda-emulated” 3D driver on the VM OS that is executed on the host, without 3D API translation/emulation).
One of the interest is to be able to share a 3D card between many VM that runs 3D application, using concurrency of execution in CUDA 4.0.