JIT .cu

Antagonistic · October 12, 2010, 7:36am

I’m doing investigation on improving our GPU computing prototyping framework by adding CUDA and OpenCL support to it (currently it only does GLSL)

A nice feature of GLSL shaders, and OpenCL actually, is that it can read in the source of the .glsl and .cl files directly and compile it during execution time. Makes it very easy to make a change, restart app, make a change, try again, etc, without having to recompile the whole thing every time. The problem is CUDA kernels.

In order to get anything resembling JIT run-time kernel loads, I need to use the CUDA driver API. Not really a problem though since it maps decently to OpenCL host code. But as far as I am aware, it is not possible to JIT a .cu file. The SDK JIT sample only shows it for .ptx files. Am I missing anything?

The other alternative would have to be to during runtime, have the framework call nvcc, compile it to a temporary directory then load the generated ptx or cubin.

Related, how effective is the JIT at optimizing CUDA kernels? Does it take shortcuts due to real-time demands or is it equivalent to nvcc? Since drivers get released more often than CUDA Toolkits, could it even be better at optimizing? In short, is it prefered to load a cubin or a JIT ptx when there is an option?

Antagonistic · October 12, 2010, 7:36am

I’m doing investigation on improving our GPU computing prototyping framework by adding CUDA and OpenCL support to it (currently it only does GLSL)

A nice feature of GLSL shaders, and OpenCL actually, is that it can read in the source of the .glsl and .cl files directly and compile it during execution time. Makes it very easy to make a change, restart app, make a change, try again, etc, without having to recompile the whole thing every time. The problem is CUDA kernels.

In order to get anything resembling JIT run-time kernel loads, I need to use the CUDA driver API. Not really a problem though since it maps decently to OpenCL host code. But as far as I am aware, it is not possible to JIT a .cu file. The SDK JIT sample only shows it for .ptx files. Am I missing anything?

The other alternative would have to be to during runtime, have the framework call nvcc, compile it to a temporary directory then load the generated ptx or cubin.

Related, how effective is the JIT at optimizing CUDA kernels? Does it take shortcuts due to real-time demands or is it equivalent to nvcc? Since drivers get released more often than CUDA Toolkits, could it even be better at optimizing? In short, is it prefered to load a cubin or a JIT ptx when there is an option?

Sarnath · October 12, 2010, 11:42am

Consider invoking “nvcc -ptx” at compile time to convert CUDA to PTX and then load it using CUDA driver API.

OpenCL probably does a similar thing. Therez nothing that CUDA cannot do that NVIDIA’s OpenCL implementation can.

Sarnath · October 12, 2010, 11:42am

Consider invoking “nvcc -ptx” at compile time to convert CUDA to PTX and then load it using CUDA driver API.

OpenCL probably does a similar thing. Therez nothing that CUDA cannot do that NVIDIA’s OpenCL implementation can.

Antagonistic · October 12, 2010, 12:16pm

Ah, perhaps I wasn’t completely clear. This is a compile-once framework used for prototyping. Usually we develop an algorithm in Matlab, then port it to GLSL in a prototyping framework, then write its actual app. This is something I’d like to have support CUDA and OpenCL.

The program you actually compile is basically script-format: like read in image x, copy to GPU, run boxfilter, copy back, display or save to file. This is in C++, but with most of the complex platform stuff hidden behind classes, making it easy to use by students who have never used OpenGL or GLSL before. After that most of the work is on the kernels.

You have several .cu/.cl/.glsl files in a subdirectory somewhere. User tweaks them, runs the .exe, sees the results, tweaks them, runs the .exe, sees the results, etc. No need to recompile the whole app every time you make a tweak, you can actually use notepad to develop in if needed.

Now the problem is, I can give GLSL a path for a shader source and it’ll JIT it for me without recompiling. I can give OpenCL a path for a kernel source and it’ll run-time JIT it for me without recompiling. I can not however pass it a .cu file, only a .ptx file that needs to be compiled with nvcc.

Now, I can hide the .cu->.ptx runtime compiling. It’ll need the app to know the location of nvcc.exe that adds complexity, but its not impossible. What I’m asking though is if it is necessary. Is there no way to JIT from a .cu or does it have to be from a .ptx?

Antagonistic · October 12, 2010, 12:16pm

Ah, perhaps I wasn’t completely clear. This is a compile-once framework used for prototyping. Usually we develop an algorithm in Matlab, then port it to GLSL in a prototyping framework, then write its actual app. This is something I’d like to have support CUDA and OpenCL.

The program you actually compile is basically script-format: like read in image x, copy to GPU, run boxfilter, copy back, display or save to file. This is in C++, but with most of the complex platform stuff hidden behind classes, making it easy to use by students who have never used OpenGL or GLSL before. After that most of the work is on the kernels.

You have several .cu/.cl/.glsl files in a subdirectory somewhere. User tweaks them, runs the .exe, sees the results, tweaks them, runs the .exe, sees the results, etc. No need to recompile the whole app every time you make a tweak, you can actually use notepad to develop in if needed.

Now the problem is, I can give GLSL a path for a shader source and it’ll JIT it for me without recompiling. I can give OpenCL a path for a kernel source and it’ll run-time JIT it for me without recompiling. I can not however pass it a .cu file, only a .ptx file that needs to be compiled with nvcc.

Now, I can hide the .cu->.ptx runtime compiling. It’ll need the app to know the location of nvcc.exe that adds complexity, but its not impossible. What I’m asking though is if it is necessary. Is there no way to JIT from a .cu or does it have to be from a .ptx?

jack · October 12, 2010, 12:45pm

I don’t think you can JIT a .cu file (not that I’ve ever heard of, anyway). Look at it this way: a .cu file may or may not contain all the necessary code, which means the JIT compiler would have to parse it, locate and dependencies, etc. (just like nvcc); on the other hand, a .ptx file contains only device code and doesn’t allow for any other files to be referenced, so it is ‘self-sufficient’ (if you can call it that).

Also, unless you need to get the absolute maximum performance from your hardware (i.e., 99% isn’t good enough), then you should probably just compile to PTX to save yourself the trouble of having to write a full-on compiler and dealing with the lowest-level architectural details. In the CUDA driver API, you can call cuModuleLoadDataEx(), which lets you set various options – one of which is the ‘strength’ of the optimizations used by the JIT compiler. Note that the default level is ‘maximum’; I’ve never tried anything lower, but I don’t know that it would save you much compilation time.

jack · October 12, 2010, 12:45pm

I don’t think you can JIT a .cu file (not that I’ve ever heard of, anyway). Look at it this way: a .cu file may or may not contain all the necessary code, which means the JIT compiler would have to parse it, locate and dependencies, etc. (just like nvcc); on the other hand, a .ptx file contains only device code and doesn’t allow for any other files to be referenced, so it is ‘self-sufficient’ (if you can call it that).

Also, unless you need to get the absolute maximum performance from your hardware (i.e., 99% isn’t good enough), then you should probably just compile to PTX to save yourself the trouble of having to write a full-on compiler and dealing with the lowest-level architectural details. In the CUDA driver API, you can call cuModuleLoadDataEx(), which lets you set various options – one of which is the ‘strength’ of the optimizations used by the JIT compiler. Note that the default level is ‘maximum’; I’ve never tried anything lower, but I don’t know that it would save you much compilation time.

Sarnath · October 12, 2010, 2:01pm

@Antagonistic,

NVIDIA’s OpenCL code when JITed merely gets converted to PTX and then loaded on the device.
Therez no other magic out there…
A similar JIT for CUDA code would be convert it to PTX and then load it on the device,

Thats all…

Sarnath · October 12, 2010, 2:01pm

@Antagonistic,

NVIDIA’s OpenCL code when JITed merely gets converted to PTX and then loaded on the device.
Therez no other magic out there…
A similar JIT for CUDA code would be convert it to PTX and then load it on the device,

Thats all…

kappa · October 12, 2010, 10:35pm

The Kappa framework does a JIT on a .cu file–but under the hood it is just invoking nvcc followed by a driver API JIT compile of the ptx. It does compare the timestamps of the .cu and .ptx files and only invokes nvcc when it needs to.

(A major reason that the Kappa framework provides a configuration value framework is to make invoking nvcc and equivalent functionality a sane proposition.)

kappa · October 12, 2010, 10:35pm

The Kappa framework does a JIT on a .cu file–but under the hood it is just invoking nvcc followed by a driver API JIT compile of the ptx. It does compare the timestamps of the .cu and .ptx files and only invokes nvcc when it needs to.

(A major reason that the Kappa framework provides a configuration value framework is to make invoking nvcc and equivalent functionality a sane proposition.)

Antagonistic · October 13, 2010, 6:57am

Thanks guys, this is what I wanted to know.

I’ll go the ‘running nvcc under the hood’ route. I can see annoyances like presenting compiler output and errors to the user, changing paths etc, but shouldn’t be too bad.

Antagonistic · October 13, 2010, 6:57am

Thanks guys, this is what I wanted to know.

I’ll go the ‘running nvcc under the hood’ route. I can see annoyances like presenting compiler output and errors to the user, changing paths etc, but shouldn’t be too bad.

avidday · October 13, 2010, 7:33am

You might want to take a look at PyCUDA - it has an already working cache/JIT framework for CUDA code and hides all of the driver API tapdancing behind a straightforward, “pythonic” interface.

avidday · October 13, 2010, 7:33am

You might want to take a look at PyCUDA - it has an already working cache/JIT framework for CUDA code and hides all of the driver API tapdancing behind a straightforward, “pythonic” interface.

jack · October 13, 2010, 6:13pm

Or, if you are familiar with .NET, we’re releasing the beta of GPU.NET in a couple of weeks (see the links in my sig). It does JIT-compilation of .NET to device code and also handles all of your memory transfers, etc.

jack · October 13, 2010, 6:13pm

Or, if you are familiar with .NET, we’re releasing the beta of GPU.NET in a couple of weeks (see the links in my sig). It does JIT-compilation of .NET to device code and also handles all of your memory transfers, etc.

Topic		Replies	Views
JIT Details CUDA Programming and Performance	14	3374	January 9, 2018
CUDA + user scripting (e.g. Lua) CUDA Programming and Performance	33	8869	November 16, 2010
Cuda OpenCL comparison cuda, openCL, nvidia CUDA Programming and Performance	19	42611	November 1, 2012
PTX jit spills registers in trivial programs CUDA Programming and Performance	9	824	February 28, 2024
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library Technical Blog	6	614	August 22, 2024
X64,VC2013,WIN8.1 , cudaMallocPitch block forever. CUDA Programming and Performance	12	2498	March 9, 2016
OpenCL Offline Compilation CUDA Programming and Performance	7	6204	July 11, 2010
Driver JIT compilation CUDA Programming and Performance	6	4407	September 9, 2016
OpenCL or CUDA? CUDA Programming and Performance	16	10953	October 26, 2011
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26837	January 19, 2009

JIT .cu

Related topics