I’m trying to put together a program, what relies on gpu computing, but we should be able to extend it with external dlls / ptx(cubin) files. Compiling everything into a single executable / kernel is a nightmare, and makes development extremely slow (slow compile times, I can’t compile specific modules etc)…
So all I want to do is to define some behavior in c++ file, compile it to a ptx / cubin file and load that code into my main kernel. (if possible I don’t want to generate ptx code myself, I’m not familiar with code generation…)
So the basic structure looks like this
file1.ptx
someFunction
file2.ptx
someKernel
And someKernel calls the someFunction from the file1.ptx. I have tried to extract the function pointer to the someFunction (it is 128 actually) and pass it to the someKernel and using the driver api, but that doesn’t work.
So that means my function pointers are only valid for the kernel execution?
If yes is there some easy way to merge these two ptx files using the driver api functions, so I could use someFunction in someKernel? (and they shouldn’t take much long, since instant, or nearly instant code loading is important… nvcc works great, but with highly complex code the compilation is very slow)
If not, what happens if I manually merge the two ptx files, and copy the someKernel code to other ptx file, and get the function pointers from that module? I am afraid that would screw up my registers, right? (I’m using a lot of recursive calls, so I have a lot of stack available to the functions, can they manage that automatically?)
Just for clarification, is your someFunction in file1.ptx a regular cpp function, or is it a cuda kernel? i.e. are you trying to call a regular, host-based function in a ptx file from a kernel in a separate ptx file?
It is a device function, a simple function what does something with my data. It is like extending a closed source application with your own c++ dll-s, but with the application being a complex cuda kernel, and the dll a device function.
Or something what OptiX does, you are able to extend the optix kernel with your own shaders, geometry objects etc… But if possible, I want to avoid taking their path (recreating the whole kernel on the fly). I have been able to do something like this with simple function pointers, it works great, but since external linkage is not supported in kernels (though I can understand why) and It seems that function pointers only valid for the actual kernel.
I havent been able to use a kernel, what sets a function pointer in the memory and call that from a different kernel. Of course Im talking about ptx files, and using the driver api it works with the runtime api, but all of the kernels are need to be in the same cu file… (I have tried using multiple cu files and passing different pointers between them but it didn`t worked)
device functions don’t currently get external symbols in the host object files emitted by the CUDA toolchain, so there is no way currently to do what you are asking for.
PTX 2.x supports indirect call via pointer, so it might be at least theoretically possible, but it would require extremely careful design. You would probably have to use inline PTX in your CUDA code to implement the function call, because I very much doubt the compiler could be persuaded to generate indirect calls via anonymous pointers. Even then I would be highly skeptical it would work without completely turning inline compilation off in the compiler, which could have pretty major performance implications otherwise.
Please post your working proof of concept when you get it going so we can have a look at it…
Thank you for the tips! I don’t think using inline ptx would be that complicated, but turning off inline calls is pretty harsh… Can I do that using pragma-s only in some parts of the code?
To the best of my knowledge, inline calls are all or nothing, and only controllable by the -Xopencc=“-INLINE:=off” option to nvopencc. So you either have inline calls, or you don’t. But it might still be possible have inline function expansion and use inline PTX for the calls where you will (somehow) provide the anonymous function pointer after the PTX has been generated. This is all really at the outer edge of what is documented and how things really work, so best of luck with it.
So here is the reply, about what I found out (sorry for the late one, I had a lot of other tasks recently).
ATM I’m compiling the whole code into multiple ptx files, and parsing them together with the main kernel (in the external ptx files I’m using a small kernel to pass the function pointers at program init), and passing that to the JIT. It works, though not the best solution… It works with simple cases, but in more complex ones I do need to modify some part of the code (renaming constants and so on…). This way I don’t need to write my own ptx generation, and I can rely mostly on nvcc.