Generate CUDA at run-time ?

Polar_01 · September 14, 2011, 9:22am

Hi,

My application has :

99% of the code will be always the same. It is a “library of functions”.
1% of the code is generated at run time and really depend of the parameters (Shading language).

So, I would like to know if :
a) I can generate some CUDA code at run-time, then compile it
b) Avoid to include the “library of functions” and recompile it each time. But simply link it ! By example I only compile the Library, save it as PTX or any other format and later when I have to compile the remaining 1% I tell to CUDA to use this PTX.

Thanks

wlangdon · September 14, 2011, 1:21pm

Tony E Lewis had a paper at this year’s CIGPU

which showed compiling PTX was rather faster than compiling CUDA C.
Bill

Polar_01 · September 14, 2011, 2:06pm

Thanks… but it does not really help me to solve my problems (Current library compilation need 12 minutes to compile). So, I need a way to link my CUDA C code with the library (without including the library code).

wlangdon · September 14, 2011, 3:12pm

Hmm I guess I am misunderstanding.

You said:

Yes. I have created C code at run time (lots of printf)

and compiled the code using nvcc.

Tony has shown he can generate PTX at run time and compile that.

(Compilation of PTX is faster than compilation of C).

nvcc seems to have a prefered code size.

It may be you can speed up the compilation by breaking it into chunks

and giving each chunk to nvcc one at a time. Chunks of about 2000

lines each seemed like a good compromise. I am using gcc and Linux.

gcc (ie the unix linker) took very little time to link ten files

(plus my host code) into one executalble.

[Ages ago nvcc was compiling 329 C lines per second but this will be highly

variable.]

tera · September 14, 2011, 3:23pm

I think the problem here is that CUDA has no linker on the device side, so the broken-into-chunks code cannot be linked again.

You should be able to work around this by compiling to PTX, then prepending your previously compiled PTX “library” and using inline PTX assembly to call the library routines.

wlangdon · September 14, 2011, 4:15pm

The unix linker used to be quite happy to link together multiple .o files

created by nvcc. It seems to have no problem with make lines like:

/tmp/gp.exe: test.o testgpu.o gpu.o \

        /tmp/pop_0.o /tmp/pop_1.o /tmp/pop_2.o /tmp/pop_3. /tmp/pop_4.o        \

        /tmp/pop_5.o /tmp/pop_6.o /tmp/pop_7.o /tmp/pop_8.o /tmp/pop_9.o 

        $(CC) $(CFLAGS) -o /tmp/gp.exe test.o testgpu.o gpu.o \

        /tmp/pop_0.o /tmp/pop_1.o /tmp/pop_2.o /tmp/pop_3.o /tmp/pop_4.o        \

        /tmp/pop_5.o /tmp/pop_6.o /tmp/pop_7.o /tmp/pop_8. /tmp/pop_9.o        \

        $(NVFLAGS) -lm

where /tmp/pop_0.o etc are created by nvcc

tera · September 14, 2011, 4:55pm

Yes, but that links the host code, not the device code. Or have you found a way to call a device function from a kernel in a different file?

wlangdon · September 14, 2011, 6:08pm

Ok. I think I understand your point.

nvcc inlines all the device code when it compiles it (hence there are no functions which might have been called by kernel code compiled separately). I do not know if nVidia are planning to change this? Would passing function pointers allow device functions to be called externally?

In my case each separately compiled chunk of code is a different kernel.

Polar_01 · September 15, 2011, 6:29am

Thanks for all your answers,

Yes, the problem is that I need a “DEVICE LINKER” to link different PTX code for the device (not the host).

Is there a solution ?

melonakos · September 23, 2011, 2:37am

In the beginning, Jacket compiled CUDA on-the-fly. It was cumbersome, but was good enough to provide speedups. Several years ago, we switched to emitting PTX in both Jacket and LibJacket and the compile time is such a small fraction of the total wall time for most problems that it is as good as well-written statically compiled code.

James_Malcolm1 · September 23, 2011, 3:02pm

I’m also a developer at AccelerEyes working with Jacket and LibJacket. Since most programs are running roughly the same code and data types, we typically see JIT using cached versions of instruction sequences. Only in the first few iterations would it recompile stuff, and then it just settles into a steady state pulling from cache. So the compile time disappears completely in Jacket and Libjacket.

Polar_01 · September 26, 2011, 1:03pm

Thanks for clarrification,

Imagine that I have a shading language, so the user can ‘implement’ its own shader … so, I have the following workflow :

Shading language => CUDA => PTX => …

But the shading language has a lot of built-in functions. The problem is that today I have to redo the whole workflow each time the user change a simple line in his shader !
It takes 12 minutes to compile with only a few simple shaders !

What I would like is a kind of CUDA dll :-P This way I put all the functions in the CUDA_DLL and use it in my kernel.

BTW, I know for CUDA cache, it works very well… but it is not really a solution because I have to rebuild all in a lot of cases !
For now, I’m working with OpenCL, so rewriting everything in CUDA request a lot of work, and it is more work to generate PTX than CUDA !
I’m not sure I have the time and budget for this !

hyqneuron · September 27, 2011, 1:08pm

Thanks for clarrification,

Imagine that I have a shading language, so the user can ‘implement’ its own shader … so, I have the following workflow :

Shading language => CUDA => PTX => …

But the shading language has a lot of built-in functions. The problem is that today I have to redo the whole workflow each time the user change a simple line in his shader !

It takes 12 minutes to compile with only a few simple shaders !

What I would like is a kind of CUDA dll :-P This way I put all the functions in the CUDA_DLL and use it in my kernel.

BTW, I know for CUDA cache, it works very well… but it is not really a solution because I have to rebuild all in a lot of cases !

For now, I’m working with OpenCL, so rewriting everything in CUDA request a lot of work, and it is more work to generate PTX than CUDA !

I’m not sure I have the time and budget for this !

Read tera’s comment again and follow his advice. Once you have learned how to do it you’ll realize it’s really fast and efficient.

If the GPU’s page table remains unmodified for different kernels in the same context (memory permission remains the same), doing dynamic linking on the GPU should be as straightforward as that on the CPU. But I would never recommend somebody to go that way unless even ptx compiling takes unacceptably long as well, because function calling on the GPU is way more complicated than that on x86 CPUs due to the large number of registers that need to be taken care of when you transfer control from one piece of code to another.

NVIDIA claims ptxas can compile 10,000 lines of code per second (on a specified platform, of course), which should be enough unless you are writing a GPU operating system External Image

parallelis · September 28, 2011, 7:34pm

Funny that it reminds me about the idea of emulating a full 3D GPU, including shaders, using CUDA Kernels, to be able to offer full 3D GPU to VM with OS-independence (says a Windows VM using DirectX supported on a Linux system, transparently, offering “cuda-emulated” 3D driver on the VM OS that is executed on the host, without 3D API translation/emulation).
One of the interest is to be able to share a 3D card between many VM that runs 3D application, using concurrency of execution in CUDA 4.0.

Topic		Replies	Views
Slow compile and cudaMalloc CUDA Programming and Performance	8	3691	February 2, 2011
Build Error MSB3721 When calling object method within kernel, using compiler directives CUDA Programming and Performance	9	5723	November 18, 2015
CUDA C language compatibility CUDA Programming and Performance	11	3715	July 15, 2009
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26820	January 19, 2009
Slow compiling with CUDA 7.5 and MS VS 2013 CUDA Programming and Performance	4	2972	January 6, 2016
Separate Compilation and Linking of CUDA C++ Device Code Technical Blog	39	1668	September 8, 2019
NVCC at Runtime - End User Friendly Configuration Compiling GPU code without requiring Visual Studio CUDA Programming and Performance	16	10435	June 19, 2009
How do You Run a CUDA Program on Multiple Systems? CUDA Programming and Performance	8	6294	August 16, 2011
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	42592	November 1, 2012
Simple CUDA build rule for Visual Studio 2005 CUDA Programming and Performance	28	83876	June 9, 2009

Generate CUDA at run-time ?

Related topics