OpenCL Offline Compilation

Hi everyone–quick question…

Because I am having an issue with what appears to be host overhead when executing an application, I am wondering if it is because I am using Just In Time (JIT) compilation (intermediate representation), instead of off-line (or back-end) compilation. Unfortunately, I’m not sure how off-line compilation is implemented. I see several comments in NVIDIA literature about it, but no specifics on how to do it. Does clEnqueueNativeKernel() or clCreateProgramWithBinary() have anything to do with this? I understand that NVIDIA may not currently support this feature, but I would like to know (at least at a high-level) how this is done, as I assume it will be implemented in the future.

Thanks in advance for any help with this.

Are you building CL programs from sources or do you load precompiled binaries?

Right now I compile with source, but I could use pre-compiled binaries if that helps host overhead by completely compiling offline (rather than during runtime). I saw some documentation that said something to the effect that the binaries can be cached and this would help not having to compile source code for every subsequent kernel launch. I am confused as to what this actually means though. If I compile with source or with binary, it would seem that any additional compilation for a subsequent kernel call in the same application would not be necessary. Once the JIT compiler does it’s job the first time, would it still need to re-compile after a subsequent kernel call (in the same application)?

What I need to get to is to completely compile the code once (not JIT–a complete compilation to a binary executable by the device)–either online (runtime) or offline and not incur the time penalty for any further compilation for the life of the application. Can this be done?

Thanks

I wanted to add this as part of my question for clarification (hopefully)…

From the OpenCL Programming Guide, OpenCL kernels are compiled to PTX code. During runtime, these kernels would have to be further compiled to device-ready binary code. The clCreateProgramWithBinary() call seems to only apply to the PTX code (according to the Programming Guide). However, the OpenCL Jumpstart guide says the following:

The sentence I am trying to understand is “it is possible in principle for a developer to recreate the tools for a workflow like the CUDA one, where a separate compile (implemented based on the OpenCL library) is used to compile binaries which the application loads during runtime”". This would seem to indicate that an OpenCL kernel could be compiled to “device-ready binary code”, akin to a cubin file offline (before runtime). What is not clear from NVIDIA documentation or anything I can find on the Internet is how to accomplish this. My particular application requires real-time performance, so I am thinking I can increase performance by completely compiling a kernel offline (instead of the normal JIT method for OpenCL).

Not sure if this is applicable, but the time required for the first compilation is of no concern, I am just trying to avoid any compilation required for “device-ready binary code” on subsequent kernel launches in the same application.

Thanks for any help with this…

Well I’m not sure about this but I believe the closest thing to native binary you can get with OpenCL’s API is PTX. True binary is always hidden (unlike in CUDA with cubin, tho with the advent of Fermi that is changing to focusing on more portable PTX as well) - at least that’s my impression.

The good news is that the lengthy part should be compilation from OpenCL-C to PTX. Generating code from PTX to native assembly should be really fast.

So you should do it like this:

  1. Create and build your program from C sources (clCreateProgramWithSource, then clBuildProgram).
  2. Get the binaries (clGetProgramInfo with CL_PROGRAM_BINARIES and CL_PROGRAM_BINARY_SIZES) - this should get you either the PTX or cubin, I suspect the former but I don’t know for sure (“implementation defined”).
  3. Save those binaries to a file.
  4. In future calls, read this file, create and build the program from binaries (clCreateProgramWithBinary, then clBuildProgram).

Also, for real time operation - this is not an issue at all. When the program is built and kernels have been created/extracted from it, you’re as ready as you’ll ever be. You will not rebuild (or even JIT the PTX to cubin) at each kernel launch. JIT only happens during init, after that you have true device code, only OpenCL API may not give it to you. The only possible overhead you may be getting is copying the raw device instructions to the GPU (which is unavoidable and not related to JIT, you get the same thing in CUDA)

@Big Mac

Thanks very much for the reply–a lot of good information there. If I could ask you one more question about JIT, in general. You said that JIT was not a problem for real-time operations, since the JIT only happens once. I am wondering where the resulting binary is stored by the OpenCL driver. I assume it is in host memory, but then may be moved to disk, if it is stored as ‘pageable’ memory. Assuming it is pageable, the binaries may get moved to disk (especially if you have a bunch of unique kernels in your application and not much host memory). If it does get moved to disk, wouldn’t it be faster for the driver to JIT the ptx-type code again vs. disk access time? Thanks in advance for your help on this–I really want to understand how this works!!

I presume it might store compiled kernels in host’s pageable memory but the risk of your system swapping them out is probably very small:

  • The code, even for dozens of kernels, will likely be very small compared to buffers (unless you’re using really tiny buffers, which is obviously ineffective, and have just megabytes of host memory).
  • If the driver uses those binaries often (ex. copying them to device before launching each kernel), OS logic will likely strive to keep them in RAM. If the OS figures they are so seldom used that it’s safe to move them to swap, you shouldn’t worry either.

I’m pretty sure that OS swapping your kernels’ binaries is the last performance issue you should be worried about right now. I’ve never heard about anyone having problems with this.

Plus, it’s quite possible that NVIDIA drivers use non-pageable, mapped memory to store code. While I’m not really qualified to say what modern OSes do with mapped memory (I hear Windows’ Vista/7 WDDM can even swap GPU memory!), I’m pretty sure you shouldn’t loose sleep over it.

@Big_Mac

 Once again--thanks for all the great info/insight!!!