I should clarify, the #1 reason we don’t use the Runtime API (and anything that depends on it) is that previously (though this is less so the case right now) it lagged behind the Driver API, but primarily - it’s extra work to delay load cudart…
I haven’t looked into the problem of delay loading cudart lately, but I’m guessing it still requires one to abstract all calls to it, into your own DLL, and then delay load that your own DLL instead… Unless I missed something in the release notes of the past few revisions, where you guys implemented delay loading into cudart? (but I’m fairly certain I wouldn’t have missed something as huge as that.)
Thrust (and I’m guessing cublas? I can’t recall - I’ve never used it personally) also directly depend on cudart, meaning any calls to Thrust/etc we want to make, would also have to be abstracted into their respective shared libraries as well… (it’s viral you see :P)
100% of our CUDA accelerated apps, don’t assume CUDA is installed, don’t assume any nVidia product of any kind is installed, and as such - we can’t load any CUDA dlls until the very last millisecond - where the user or our application has determined they ‘do’ indeed want to use CUDA, and they have everything installed to do so…
On the contrary - the Driver API is exceedingly simple to delay load… and despite losing ‘features’ like cuda-gdb (unstable… difficult to use… etc) and device emulation (serial thread execution, which means anything that requires warp/block synchronization - is instantly going to break… requiring one to maintain a second codebase for those kernels that do require synchronization to work… ugh!) - we had no choice but to stick with the Driver API because of that.
The Driver API also gives a much clearer line drawn between host and device code, making it simpler for regular devs (with no CUDA experience) to not get confused, and simplifies the build process as well (which is important for people using custom build systems) - maintaining a nice happy development eco-system for everyone, regardless of experience :P
As for the solution (now I’ve had my rant ;)) - it’s been pretty simple in my eyes for quite some time (read: years…) now… The solution depends on how cudart is implemented though… I’ve always assumed it was built on top the driver api (no other magic nvidia driver calls, pure Driver API wrapper), so I’m going to maintain that assumption for now.
If this is the case, the ‘best’ solution to me would be to directly implement delay-loading either into the driver api (preferable, if cudart relies exclusively on the drive rapi) or the runtime api (if the runtime api has other trickery besides driver api calls)…
As it stands,
linking against the Driver API (and thus cudart) the first call into the Driver API (cuInit), and thus the first call into cudart (and thus the first call into Thrust…?) will force-load nvcuda.dll/libcuda.so/libcuda.dylib (depending on OS) when loading your executable (eg: before the executable’s entry point is executed? I haven’t tested this recently, but this is what I recall) - which is simply unacceptable for any production environment… instant crash on any machine without nVidia CUDA drivers installed.
Edit: Fixed misinformation (re-read this last night at home… noticed this error)
Windows/msvc users have it nice and easy, they can use delayimp.lib to delay load the Driver API, *nix based systems though require one to manually delay load (read: write your own cuda.lib/cuda.a library, which dynamically loads the real one, and at runtime, checks for the existance of your function, and redirects to that if it exists… otherwise throw a nice “invalid cuda version” exception).
(Ironically as I was writing this, my CEO came up and asked me specifically what was taking so long - and I just explained exactly what I’m explaining here… but with an emphasis on lack of debugging tools… except Nexus…)
Anyway this has been a widely known (well I thought so… maybe not?) issue for quite some time now, see below:
I could find a few more I’m sure, but I think my rant explains the problem well enough for you guys to see the value of delay loading things on your end - and not ours…