How to run optional CUDA app on a machine with no GPU/CUDA runtime/drivers?

I have a pretty complex app with bunch of .cpp/.cu files and want to make everything CUDA related optional (if there is GPU + drivers on the machine, use the GPU code, otherwise don’t).

Is there a way to dynamically link and defer load CUDA runtime + CUDA libraries (load it on 1st API/kernel call), so when I start the application on non-GPU machine, it works fine (no cuda runtime is loaded)?

Is there easy way to test this (to “hide” GPU on the machine) with 1 dummy .cpp and .cu file?

A typical method would be to statically link your application against the cudart library. When you compile with nvcc this is the default behavior.

You would then do a “trivial” CUDA call such as cudaSetDevice(0); at the beginning of your application. If that call returns an error, then the assumption is that CUDA is not available and you proceed with your non-GPU code.

You cannot use this method if you choose to link cudart dynamically.

In that case there is no way to dynamically link (using my definition of that term, meaning a link specification with -lcudart and making calls to CUDA runtime API “normally” in your application) to the CUDA runtime. If you attempt to do something like that with the CUDA runtime dynamically linked, and that library is not on the target machine, the app will fail at launch time.

This situation is not unique or specific to CUDA. Any application that has a library dependency, roughly speaking, has the above “avenues” available to it.

Therefore the linux world has come up with alternatives, and I won’t be able to write a treatise for you. The basic “run-time load” idea, however, is to keep a table of API calls that you intend to make. You do not dynamically link against the API (cudart, in this case) but instead your application makes calls via the aforementioned table. At application load time, you use the linux dynamic load facility (dlopen) to check for library presence, and if present, “load” it. Once it is loaded, you get entry points for all the API calls you care about, and use those entry points to update your call table.

Thereafter, application behavior proceeds as previously described. If you attempt to look for and load the library (cudart in this case) and you do not find it, or the process fails, then you assume that the target machine does not have CUDA. If the runtime load process is successful, then you would probably want to do the “trivial” CUDA test as already mentioned, before proceeding with other CUDA tasks.

One possible method to test would be to “hide” the GPUs on a CUDA machine with e.g. CUDA_VISIBLE_DEVICES. Here is an example, that also uses the static link mechanism I suggested first.