I’m doing investigation on improving our GPU computing prototyping framework by adding CUDA and OpenCL support to it (currently it only does GLSL)
A nice feature of GLSL shaders, and OpenCL actually, is that it can read in the source of the .glsl and .cl files directly and compile it during execution time. Makes it very easy to make a change, restart app, make a change, try again, etc, without having to recompile the whole thing every time. The problem is CUDA kernels.
In order to get anything resembling JIT run-time kernel loads, I need to use the CUDA driver API. Not really a problem though since it maps decently to OpenCL host code. But as far as I am aware, it is not possible to JIT a .cu file. The SDK JIT sample only shows it for .ptx files. Am I missing anything?
The other alternative would have to be to during runtime, have the framework call nvcc, compile it to a temporary directory then load the generated ptx or cubin.
Related, how effective is the JIT at optimizing CUDA kernels? Does it take shortcuts due to real-time demands or is it equivalent to nvcc? Since drivers get released more often than CUDA Toolkits, could it even be better at optimizing? In short, is it prefered to load a cubin or a JIT ptx when there is an option?
I’m doing investigation on improving our GPU computing prototyping framework by adding CUDA and OpenCL support to it (currently it only does GLSL)
A nice feature of GLSL shaders, and OpenCL actually, is that it can read in the source of the .glsl and .cl files directly and compile it during execution time. Makes it very easy to make a change, restart app, make a change, try again, etc, without having to recompile the whole thing every time. The problem is CUDA kernels.
In order to get anything resembling JIT run-time kernel loads, I need to use the CUDA driver API. Not really a problem though since it maps decently to OpenCL host code. But as far as I am aware, it is not possible to JIT a .cu file. The SDK JIT sample only shows it for .ptx files. Am I missing anything?
The other alternative would have to be to during runtime, have the framework call nvcc, compile it to a temporary directory then load the generated ptx or cubin.
Related, how effective is the JIT at optimizing CUDA kernels? Does it take shortcuts due to real-time demands or is it equivalent to nvcc? Since drivers get released more often than CUDA Toolkits, could it even be better at optimizing? In short, is it prefered to load a cubin or a JIT ptx when there is an option?
Ah, perhaps I wasn’t completely clear. This is a compile-once framework used for prototyping. Usually we develop an algorithm in Matlab, then port it to GLSL in a prototyping framework, then write its actual app. This is something I’d like to have support CUDA and OpenCL.
The program you actually compile is basically script-format: like read in image x, copy to GPU, run boxfilter, copy back, display or save to file. This is in C++, but with most of the complex platform stuff hidden behind classes, making it easy to use by students who have never used OpenGL or GLSL before. After that most of the work is on the kernels.
You have several .cu/.cl/.glsl files in a subdirectory somewhere. User tweaks them, runs the .exe, sees the results, tweaks them, runs the .exe, sees the results, etc. No need to recompile the whole app every time you make a tweak, you can actually use notepad to develop in if needed.
Now the problem is, I can give GLSL a path for a shader source and it’ll JIT it for me without recompiling. I can give OpenCL a path for a kernel source and it’ll run-time JIT it for me without recompiling. I can not however pass it a .cu file, only a .ptx file that needs to be compiled with nvcc.
Now, I can hide the .cu->.ptx runtime compiling. It’ll need the app to know the location of nvcc.exe that adds complexity, but its not impossible. What I’m asking though is if it is necessary. Is there no way to JIT from a .cu or does it have to be from a .ptx?
Ah, perhaps I wasn’t completely clear. This is a compile-once framework used for prototyping. Usually we develop an algorithm in Matlab, then port it to GLSL in a prototyping framework, then write its actual app. This is something I’d like to have support CUDA and OpenCL.
The program you actually compile is basically script-format: like read in image x, copy to GPU, run boxfilter, copy back, display or save to file. This is in C++, but with most of the complex platform stuff hidden behind classes, making it easy to use by students who have never used OpenGL or GLSL before. After that most of the work is on the kernels.
You have several .cu/.cl/.glsl files in a subdirectory somewhere. User tweaks them, runs the .exe, sees the results, tweaks them, runs the .exe, sees the results, etc. No need to recompile the whole app every time you make a tweak, you can actually use notepad to develop in if needed.
Now the problem is, I can give GLSL a path for a shader source and it’ll JIT it for me without recompiling. I can give OpenCL a path for a kernel source and it’ll run-time JIT it for me without recompiling. I can not however pass it a .cu file, only a .ptx file that needs to be compiled with nvcc.
Now, I can hide the .cu->.ptx runtime compiling. It’ll need the app to know the location of nvcc.exe that adds complexity, but its not impossible. What I’m asking though is if it is necessary. Is there no way to JIT from a .cu or does it have to be from a .ptx?
I don’t think you can JIT a .cu file (not that I’ve ever heard of, anyway). Look at it this way: a .cu file may or may not contain all the necessary code, which means the JIT compiler would have to parse it, locate and dependencies, etc. (just like nvcc); on the other hand, a .ptx file contains only device code and doesn’t allow for any other files to be referenced, so it is ‘self-sufficient’ (if you can call it that).
Also, unless you need to get the absolute maximum performance from your hardware (i.e., 99% isn’t good enough), then you should probably just compile to PTX to save yourself the trouble of having to write a full-on compiler and dealing with the lowest-level architectural details. In the CUDA driver API, you can call cuModuleLoadDataEx(), which lets you set various options – one of which is the ‘strength’ of the optimizations used by the JIT compiler. Note that the default level is ‘maximum’; I’ve never tried anything lower, but I don’t know that it would save you much compilation time.
I don’t think you can JIT a .cu file (not that I’ve ever heard of, anyway). Look at it this way: a .cu file may or may not contain all the necessary code, which means the JIT compiler would have to parse it, locate and dependencies, etc. (just like nvcc); on the other hand, a .ptx file contains only device code and doesn’t allow for any other files to be referenced, so it is ‘self-sufficient’ (if you can call it that).
Also, unless you need to get the absolute maximum performance from your hardware (i.e., 99% isn’t good enough), then you should probably just compile to PTX to save yourself the trouble of having to write a full-on compiler and dealing with the lowest-level architectural details. In the CUDA driver API, you can call cuModuleLoadDataEx(), which lets you set various options – one of which is the ‘strength’ of the optimizations used by the JIT compiler. Note that the default level is ‘maximum’; I’ve never tried anything lower, but I don’t know that it would save you much compilation time.
NVIDIA’s OpenCL code when JITed merely gets converted to PTX and then loaded on the device.
Therez no other magic out there…
A similar JIT for CUDA code would be convert it to PTX and then load it on the device,
NVIDIA’s OpenCL code when JITed merely gets converted to PTX and then loaded on the device.
Therez no other magic out there…
A similar JIT for CUDA code would be convert it to PTX and then load it on the device,
The Kappa framework does a JIT on a .cu file–but under the hood it is just invoking nvcc followed by a driver API JIT compile of the ptx. It does compare the timestamps of the .cu and .ptx files and only invokes nvcc when it needs to.
(A major reason that the Kappa framework provides a configuration value framework is to make invoking nvcc and equivalent functionality a sane proposition.)
The Kappa framework does a JIT on a .cu file–but under the hood it is just invoking nvcc followed by a driver API JIT compile of the ptx. It does compare the timestamps of the .cu and .ptx files and only invokes nvcc when it needs to.
(A major reason that the Kappa framework provides a configuration value framework is to make invoking nvcc and equivalent functionality a sane proposition.)
I’ll go the ‘running nvcc under the hood’ route. I can see annoyances like presenting compiler output and errors to the user, changing paths etc, but shouldn’t be too bad.
I’ll go the ‘running nvcc under the hood’ route. I can see annoyances like presenting compiler output and errors to the user, changing paths etc, but shouldn’t be too bad.
You might want to take a look at PyCUDA - it has an already working cache/JIT framework for CUDA code and hides all of the driver API tapdancing behind a straightforward, “pythonic” interface.
You might want to take a look at PyCUDA - it has an already working cache/JIT framework for CUDA code and hides all of the driver API tapdancing behind a straightforward, “pythonic” interface.
Or, if you are familiar with .NET, we’re releasing the beta of GPU.NET in a couple of weeks (see the links in my sig). It does JIT-compilation of .NET to device code and also handles all of your memory transfers, etc.
Or, if you are familiar with .NET, we’re releasing the beta of GPU.NET in a couple of weeks (see the links in my sig). It does JIT-compilation of .NET to device code and also handles all of your memory transfers, etc.