passing function pointers between kernels

Hello guys!

I’m trying to put together a program, what relies on gpu computing, but we should be able to extend it with external dlls / ptx(cubin) files. Compiling everything into a single executable / kernel is a nightmare, and makes development extremely slow (slow compile times, I can’t compile specific modules etc)…

So all I want to do is to define some behavior in c++ file, compile it to a ptx / cubin file and load that code into my main kernel. (if possible I don’t want to generate ptx code myself, I’m not familiar with code generation…)

So the basic structure looks like this

file1.ptx

  • someFunction
    file2.ptx
  • someKernel

And someKernel calls the someFunction from the file1.ptx. I have tried to extract the function pointer to the someFunction (it is 128 actually) and pass it to the someKernel and using the driver api, but that doesn’t work.

So that means my function pointers are only valid for the kernel execution?

If yes is there some easy way to merge these two ptx files using the driver api functions, so I could use someFunction in someKernel? (and they shouldn’t take much long, since instant, or nearly instant code loading is important… nvcc works great, but with highly complex code the compilation is very slow)

If not, what happens if I manually merge the two ptx files, and copy the someKernel code to other ptx file, and get the function pointers from that module? I am afraid that would screw up my registers, right? (I’m using a lot of recursive calls, so I have a lot of stack available to the functions, can they manage that automatically?)

Cheers,
Pal.

Anyone?

Up again.

I don’t think that no one wanted to use dll-s or external ptx files to extend their kernels External Image

A few people desire to do this, but apparently nobody figured out how to circumvent the memory protection that prevents this from working.

Can you please explain what is that memory protection? (you mean I can’t access the memory where the functions defined?)

Just for clarification, is your someFunction in file1.ptx a regular cpp function, or is it a cuda kernel? i.e. are you trying to call a regular, host-based function in a ptx file from a kernel in a separate ptx file?

It is a device function, a simple function what does something with my data. It is like extending a closed source application with your own c++ dll-s, but with the application being a complex cuda kernel, and the dll a device function.

Or something what OptiX does, you are able to extend the optix kernel with your own shaders, geometry objects etc… But if possible, I want to avoid taking their path (recreating the whole kernel on the fly). I have been able to do something like this with simple function pointers, it works great, but since external linkage is not supported in kernels (though I can understand why) and It seems that function pointers only valid for the actual kernel.

I havent been able to use a kernel, what sets a function pointer in the memory and call that from a different kernel. Of course Im talking about ptx files, and using the driver api it works with the runtime api, but all of the kernels are need to be in the same cu file… (I have tried using multiple cu files and passing different pointers between them but it didn`t worked)

device functions don’t currently get external symbols in the host object files emitted by the CUDA toolchain, so there is no way currently to do what you are asking for.

And if I parse multiple ptx files together? (I mean doing a much more lightweight ptx generation than OptiX)

PTX 2.x supports indirect call via pointer, so it might be at least theoretically possible, but it would require extremely careful design. You would probably have to use inline PTX in your CUDA code to implement the function call, because I very much doubt the compiler could be persuaded to generate indirect calls via anonymous pointers. Even then I would be highly skeptical it would work without completely turning inline compilation off in the compiler, which could have pretty major performance implications otherwise.

Please post your working proof of concept when you get it going so we can have a look at it…

Thank you for the tips! I don’t think using inline ptx would be that complicated, but turning off inline calls is pretty harsh… Can I do that using pragma-s only in some parts of the code?

To the best of my knowledge, inline calls are all or nothing, and only controllable by the -Xopencc=“-INLINE:=off” option to nvopencc. So you either have inline calls, or you don’t. But it might still be possible have inline function expansion and use inline PTX for the calls where you will (somehow) provide the anonymous function pointer after the PTX has been generated. This is all really at the outer edge of what is documented and how things really work, so best of luck with it.

So here is the reply, about what I found out (sorry for the late one, I had a lot of other tasks recently).

ATM I’m compiling the whole code into multiple ptx files, and parsing them together with the main kernel (in the external ptx files I’m using a small kernel to pass the function pointers at program init), and passing that to the JIT. It works, though not the best solution… It works with simple cases, but in more complex ones I do need to modify some part of the code (renaming constants and so on…). This way I don’t need to write my own ptx generation, and I can rely mostly on nvcc.

Pal.