so we have some engineering code that would benefit from CUDA acceleration. However the same code should also execute on machines that are not equipped with CUDA hardware.
We currently use the runtime API for development. How would you create a binary where CUDA acceleration is optional, and no CUDA specific file (driver, runtime DLL etc) is strictly required at run time?
I’m currently using C# to provide a front-end for some CUDA code, using CUDA.NET from GASS to provide dynamic wrapping of the binary (.cubin file). I’m aware that the CUDA.NET methods will throw exceptions left, right and center, if you try calling CUDA code on a platform which is not CUDA enabled. Some simple exception handling can hide this from the user and also redirect how your application progresses i.e. execute code natively on the host.
I’m sure you can do something similar in any language which supports exception handling.
Even the driver API requires nvcuda.dll to be installed on the running machine.
WHat I suggest is :
when you link CUDA libraries to your application, mark them “DELAY LOADED”. Thus windows will NOT stop application loading because CUDA dlls are not present (thats my vague memory). Now, when control reaches main() – Call some routines that would first check if CUDA is installed on the machine.
Like, for example: Search $PATH, $LD_LIBRARY_PATH variables to locate "cuda.dll"or “nvcuda.dll” or both. You could consider spawning some already available executable (like “which cuda.dll” or sthg similar to that) to do this job. If not, you can write it all yourself.
Once you find that CUDA is NOT isntalled, your application can call NORMAL routines and NEVER CALL CUDA functions.
That should solve your problem.
NOTE: Delay loading has some constraints in Windows. Not sure if CUDART would be fine with it. Check this out: ms-help://MS.MSDNQTR.v80.en/MS.MSDN.v80/MS.VisualStudio.v80.en/dv_vccomp/html/0097ff65-550f-4a4e-8ac3-39bf6404f926.htm
Appreciate if NVIDIA people can comment on this aspect…
Some points on multi-GPU support:
We have written a multi-threaded foundation library for our higher level finance-library that would deal with multi-GPU support. It is a simple library. It would also help you write Multi-GPU code. Available for commercial license. Works for windows. Should work for Linux. Not tested yet.
For example we have APIs like “PFC_AcquireCUDADevices(int number_of_devices, ACQUIRE_MODE (shared/exclusive));” – Thus if your application is multi-threaded , threads could share the available GPUs among themselves.
We also provide some utility APIs that would greatly simplify the task of multi-GPU programming - for example, data-splitting API.
At the bottom, it is very much similar to Mr.Anderson’s multi-GPU implementation (thanks to him) BUT we don’t require boost. We do it all ourselves. And, we provide classes that would also make your cudaAPI calls more natural (see readability).
Also note that whatever delay-load stuff that I was mentioning could be implemented in our PFC library itself to make the job simple for you. We will do the DELAY LOADING and your application just need to link with us.
I could put all my CUDA code into an external DLL that statically links against cudart.dll (runtime API)
I dynamically open that DLL using LoadLibrary() and this fails we don’t have cudart.dll. The question is if this will fail with an annoying dialog box that the user has to acknowledge by pressing “OK”.
I then query the export symbol for my CUDA accelerated code from the DLL and call it if available.
Otherwise I fall back to the unaccelerated code.
This could work similarly (albeit with the dlopen API) on Linux.
I am in the process of adding cuda acceleration for different parts of our software. We actually added a simple yet efficient plugin architecture for this (and other stuff we want to add). we have the default plugin which is actually in our core code. then we have a simple system that looks for dlls in the pluing dir, if it finds one the it trys to dynamically load it and register it. if its successful then the calls to the original plugin now go to the new dynamically loaded one. this way our core code isn’t cuda dependent and doesn’t even know cuda exists. and if the cuda dll fails to load the it automatically falls back on the default implementation. I have the cuda plugins written actually both in run time and driver api and they both work.
Like yummig, I’ve been writing my apps in C# and linking into the cuda driver DLL. He mentioned using the GASS .NET bindings, but I’ve written my own that I’ve been using. There is also a small bit of code that I’ve been including in my apps that checks for the presence of the driver DLL, and only uses the CUDA-accelerated code if it is present. If anyone else is doing .NET development, I’ll be glad to post that code, since it makes it quite easy to write an app that is CUDA-accelerated (but seamlessly falls back to non-CUDA code if there is no hardware present).
I would say not emulation but rather automatic GPU code/CPU code dispatcher. Emulation is too slow to be used in production software, but if nvcc will ever be able to generate efficient CPU code dispatching may be a good option.
Oops, sorry for that. And thanks for pointing that out.
I haven’t used LoadLibrary() for a while and probably forgot something. In fact, I’ve never had to use SetErrorMode() as default behaviour was acceptable.
That method is fine if you’re distributing an application - in which case you can copy cudart.dll in the same directory as the application / external DLL.
But if you’re distributing a library, there’s no way to know the final application directory. You have to copy cudart.dll in the application path (or, worse, in windows\system32) which is against the nVidia recommendations for redistributing their dll’s. The only way we could have forced to load cudart.dll in a specific directory would have been through delay-load and that doesn’t work for the runtime API.
I wish nVidia could provide some feedback on this issue…