Delay load cuda runtime dll (cudart.dll)

Has anyone managed to use delay loading dll for cuda runtime dll, cudart.dll?

I have a module implementing some functionalities in cuda and I would like to be able to check all necessary cuda files (nvcuda.dll, cudart.dll). If any of these files is missing the application shouldn’t crash, but run the CPU alternative.

This is why I’ve tried using delay loading for cudart.dll and calling __HrLoadAllImportsForDll( “cudart.dll” ) before using and cuda function. But, even if no cuda functions has been called yet, if cudart.dll is absent, the application throws an exception before getting the change to call __HrLoadAllImportsForDll( “cudart.dll” ).

Any help would be greatly appreciated.

Thank you.

It won’t work because some cuda runtime code gets executed before you can call HrLoadAllImportsForDll(). I don’t remember the details, but it’s something than is not very easy to workaround.

Delay loading nvcuda.dll works just fine, however.

Thank you, AndreiB.

I’m now thinking about switching to Cuda driver API. On the other hand, device emulation is very useful when debugging a Cuda application and since this feature is only available in cuda runtime api… It’s very difficult to make a choice between the two APIs. Which one would you recommend?

I’m curious if future release will solve this problem related to delay loading cudart.dll, maybe it’s worth waiting… :D

Thank you again for your help. :wave:

In my opinion, Driver API is the only choice for production environment and/or if your program must run both on CPU and GPU. It’s not too hard to use and not much harder than Runtime API.

A simple method.

Pack your CUDA code into a dll, then delay load your dll instead of CUDA’s.

Good luck.

But is there a way to use only nvcuda.dll?

1- If I link only with cuda.lib, and I only call driver API functions within my demo’s .CPP file, everything works fine. So far so good.

2- If I add a .CU file, I end up with linker errors which only resolve if I also link with cudart.lib:

[i]1>bandwidthTest.obj : error LNK2019: unresolved external symbol ___cudaRegisterFatBinary@4 referenced in function ___sti____cudaRegisterAll_48_tmpxft_00000ee8_00000000_6_band


Here’s the command line I’m using to compile the .CU file, I don’t see where it explicitely requires cudart.lib / cudart.dll:

(CUDA_BIN_PATH)\nvcc.exe" -ccbin "(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"(CUDA_INC_PATH)" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\common\inc" -o (ConfigurationName)\bandwidthTest.obj --ptxas-options=-v


I have searched through the forum, and saw that the problem was occuring frequently : basically, it seems impossible to delay-load cudart.dll properly. Using an intermediate custom DLL would work, but is not very handy. Your advice is to prefer the driver API, which does not suffer from this drawback.

However, I would like to have NVidia’s opinion :

First, they are working hard to provide cudart.dll over cuda.dll. Second, it seems reasonable to think that the ability to make hybrid programs (that can run on CPU if CUDA is not present) is a top feature for promoting CUDA in commercial programs.

Thus, NVidia could provide a FAQ section on this problem, and why not, an “official” solution/advice, accordingly to their plans for the future ? (something like “it will be fixed in CUDA 2.3”, for instance :-) )

Any NVidia engineer here to help ?


Pierre Chatelier

I’m not sure if this would be acceptable to you but the approach I’m using is to redistribute cudart.dll with my application. I believe this is now allowed under the licence agreement provided it is placed in the application directory and not the system directory. Before attempting to use the runtime API I’m making a call to cuInit() from the driver API (and delay loading nvcuda.dll) as a mechanism to ensure that I have a real hardware device and not an emulated one but I think this can also be done by querying the device properties. I’m using the cuInit() approach because I also need to use cuMemFree().

You’re probably getting this linker error because your .cu file contains host code and that gets compiled by nvcc. Also, nvcc may implicitly generate some code utilizing Runtime API – I’m not sure on this. Solution is to compile device code with -cubin switch (i.e. nvcc -cubin; where that contains only device code) and then manually load resulting .cubin into context with Driver API.

Well, I do not see real problem here. cudart.dll can be redistributed and provides methods for enumerating CUDA devices, i.e. program can easily detect presence of compatible cards at runtime and switch to GPU or to CPU mode, so there’s no real point in delay-loading cudart.dll (except for checking integrity of distribution). nvcuda.dll is very different because it is part of the driver and thus cannot be redistributed. Its presence can be easily detectd by delay-loading mechanism.