I’d like to compile many ptx programs at runtime on a multi core system. I know there’s the driver api for rt linking, but this requires a cudevice and from what I remember this is heavily serialized. Is there a faculty to create fatbins per thread in parallel?
It seems like this could exist as nvcc doesn’t require a device present and can generate fatbins, so should be able to compile per thread.
I guess I’m not sure why the ptx to optimized sass conversion has to be such a bottleneck. Any thoughts on options I missed?
The driver API generally (JIT) compiles code for the currently selected device. So the process is not completely disconnected from the GPU. Yes, I acknowledge it doesn’t seem like that should require a lot of interaction or serialization, and I personally don’t know the extent of serialization, but I have seen reports of it. Beyond that I wouldn’t be able to explain in detail the dependencies, serialization, or rationalization for observations.
Yes, you can create fatbins with nvcc. You can create a cubin format output from nvcc also which is directly consumable by the driver API without JIT compilation.
You can request changes or enhancements to CUDA behavior by filing a bug. In this case, you might be asked for a demonstrator and your observations. Short text descriptions like what is in your post at the moment might not gain traction without a demonstrator.
Thanks Robert! Glad to know I sort of had the full picture and wasn’t missing something.
I’ll write up some tests and file a request