Calling external kernel from CUDA

Imagine I make a CUDA-based renderer. I pre-compile all.
Now imagine I let the user to make his own “shaders”. Can I compile and call dynamically the user “shader” from my pre-compiled code?

Yes cubins could be dynamically loaded. Check out cuModuleLoad() and the related driver-api functions.

Thx, hyq.

just a doubt… once I call cuModuleLoad to load the user-defined function and code… how I must define the “extern” function from my kernel code, pls?

For example:

The user functions I want to call are

extern int userDefFunc1 ( const int a, const int b ); //should I use "extern", shouldn't I?

class IInterface



      virtual int userDefFunc2 ( const int a, const int b ) = 0;


and I’ll call the user-defined function from my pre-compiled code as:

__device__ __kernel void MyKernel ( IInterface *g_pUserDef ) //can I pass as kernel param the user-defined C++ interface pointer using a CUfunction handle?


   int i, val=0;

for ( i=0; i<100; ++i )


      val += userDefFunc1(i,i+1);

      val += l_pUserDef(i*2,5);




I’ll show you some code I often use

muRC is an error checking function

CUdeviceptr gpu_output;

	CUdeviceptr sharedaddress;

	CUcontext context;

	CUdevice device;

	muRC(0, cuInit(0));

	muRC(1, cuDeviceGet(&device,0));

	muRC(2, cuCtxCreate(&context, CU_CTX_SCHED_SPIN, device));

	muRC(3, cuMemAlloc(&gpu_output, 32*4));

	muRC(4, cuMemAlloc(&sharedaddress, 4));

	CUmodule module;

	CUfunction kernel1, kernel2;

	muRC(10, cuModuleLoad(&module, "main.sm_21.cubin"));

	muRC(11, cuModuleGetFunction(&kernel1, module, "kernel1"));

	muRC(12, cuModuleGetFunction(&kernel2, module, "kernel2"));

	muRC(20, cuParamSetSize(kernel1, 4));

	muRC(21, cuParamSetv(kernel1, 0, &sharedaddress, 4)); //this way of setting kernel parameter is being deprecated, though it's still what I prefer

	muRC(22, cuFuncSetBlockShape(kernel1, 1,1,1));

	muRC(33, cuLaunch(kernel1));

	muRC(30, cuParamSetSize(kernel2, 8));

	muRC(31, cuParamSetv(kernel2, 0, &sharedaddress, 4));

	muRC(32, cuParamSetv(kernel2, 4, &gpu_output, 4));

	muRC(33, cuFuncSetBlockShape(kernel2, 1,1,1));

	muRC(34, cuLaunch(kernel2));

	muRC(50, cuMemFree(gpu_output));

	muRC(51, cuMemFree(sharedaddress));

	muRC(52, cuModuleUnload(module));

	muRC(53, cuCtxDestroy(context));

So that extern thing is completely unnecessary. You just need a CUfunction.

Thanks for the code example, hyq.
Yep, that’s a good way to call an externally-defined KERNEL. The problem is to call an external function from my pre-compiled kernel, which is sightly different. That’s the sense of “extern” and the virtual pure interface.
I’m not sure if this is possible in CUDA. How to define a C extern function and to pass a virtual interface pointer to the kernel args so I can an externally-used-defined function/interface, pls?

B: the user-defined kernel
A: the main kernel written by you that calls B
Put the source code of A and B into the same source file and compile into cubin. Then load the cubin and launch A.A must be declared as global, while B could be declared as either global or device, depending on your need.

Nope, because my kernel is and must be precompiled for IP reasons :pirate: ( and also to save compiling time). , If not I would use OpenCL, that’s one of the reasons why we use CUDA :tongue:

I don’t know what the proper way to call an external kernel inside another kernel is… (maybe someone else could help?) I haven’t checked exactly how device functions in CC2.x are handled at opcode level. I guess it still should involve a bit of hacking, though it certainly should be easier than the following method:

You could to manually merge 2 cubins.

CubinA: contains your pre-compiled kernel A and an empty kernel. This cubin is “pre-compiled”. Kernel A always calls the empty kernel. The empty kernel should be better placed at the end of the object file. But of course, an empty kernel is not easily generated since ptxas tries very hard to outsmart us. You could just create any redundant kernel here.

CubinB: contains user-defined kernel. This cubin is compiled by ptxas(or nvcc) upon the user’s request.

You’ll replace the redundant kernel in CubinA with the kernel in CubinB after the compilation of CubinB. You’ll need good knowledge of the cubin format (it’s just elf) to do that. Based on my experience, this combination should not be hard.

Then you can just load the modified CubinA and launch kernel A.

However, a still easier way is just to keep the PTX code of your source code. The cubins, once disassembled using cuobjdump, becomes no more complicated than the optimized version of the source ptx code.