NVRTC and __device__ functions

Hi All,

I am trying to optimize my simulator by leveraging realtime compilation. My code is pretty long and complex, but I identified a specific device function whose performances can be strongly improved by removing all global memory accesses.

Does CUDA allow the dynamic compilation and linking of a single device function (not global), in order to “override” an existing function?

Thank you very much indeed

Do you actually need a fairly complex setup using real-time compilation? Often, sufficient flexibility can already be achieved by the use of a templated function, appropriate instances of which (generated by offline compilation) are selected at run time, with the instance of choice invoked via a function pointer.

That’s what I usually do, but this time it is not the case.

The equations that I need to calculate change according to the model that I simulate. To tackle this variability, I always relied on a serialized encoding of the equations that is parsed GPU-side at run-time. However, this approach implies a lot of global memory accesses and a relevant overhead.

It would be completely different if the equations were hardcoded in the simulator using real-time compilation. I do not need to recompile the whole simulator, I just need to create dynamically a single device function. Is it possible to do that?