Using driver API to launch kernels A way to avoid external cubins?

The runtime API makes it really easy to launch kernels with that <<<…>>> construct when those kernels are included in your .cu file. But I’m forced to use CUstream objects from the driver API, and these can’t be used with the <<<…>>> construct.

Is there a way to use the driver API to launch a kernel that’s right there in my .cu file without going through the hassle of loading it via an external cubin? Or anyone know a way to re-package those CUstream objects as cudaStream_t (which driver_types.h says are ‘int’)?

If the runtime is simply masking this whole process, then how does nvcc nicely hide the .cubin code in the .o files. I’d rather not have to mess with external cubins.

We have one portion of our code that uses cubins created on the fly, so it requires the driver API to load cubins. We also have regular global kernels throughout that we launch with the runtime.

In the latest iteration of our project, we have introduced multiple worker threads each with its own stream so workers can block waiting for results from each other.

Problem: the driver API uses CUstream while the runtime API uses cudaStream_t.

What do you suggest? It’s advised to stick to one API, so is there any way within the runtime to load a cubin created on the fly? Or is there any way from the driver API to run a global kernel already compiled into the main program?

I’m using cuModuleLoadData() to load contents of .cubin file into GPU context. There’s also cuModuleLoad() which I believe accepts name of cubin file and loads it.

When using runtime API cubin files are embedded in .obj files, so you can find it and use with driver API, I guess.

Right now we are using cuModuleLoad() to load external .cubin files, and it works quite well. Anyone have any success using cuModuleLoadData() to load from within the executable resources? Are there any examples out there of using cuModuleLoadData() at all?

And what’s the problem?

You need to use FindResource() / LoadResource() / LockResource() function and pass pointer returned by LockResource() to cuModuleLoadData().

Thanks for the tips. Any idea where to start looking for similar functionality on Mac and Linux?

I use following approach in Win32:
.cubin file is compressed and converted to C Header file (.h) at compile time. At runtime program decompresses it and passes resulting string to cuModuleLoadData().
Results:
a. Smaller executable size
b. cubin is not present in executable module as plaintext

Same approach may be easily applied on Linux or Mac, I think.

Brilliantly simple! I feel guilty that I didn’t think of that.