Deployment deploying CUDA

I’m wondering if it’s possible to run CUDA powered applications on machines that doesn’t have the CUDA driver / toolkit / SDK, basically what are the deployment options? And are there any differences between the runtime API and the Driver API in this aspect?

Dan Lavigne

With Driver API you do not depend on anything except for driver (nvcuda.dll). With Runtime API you have to redistribute cudart.dll (and you need to check that cudart.dll is of supported version).

If you need to run your program on machines without CUDA driver then you have to mark nvcuda.dll as delay-loaded. More etails here:

thanks for the info!

Does the CUDA toolkit license allow such a redistribution?

Technically, no. But the NVIDIA guys here say you can do it. It’s kind of dumb.

Thanks Alex. I guess the safest route is to use the driver API.

you can redistribute the runtime API and redistribute cudart.dll. really, I asked Ian Buck this point-blank and he said “yes, it is okay for anyone who wants to redistribute cudart.dll to redistribute cudart.dll.” or libcudart or what have you.

hopefully this will cease to be a problem in a few months, but for the time being if you want to use cudart you can redistribute it. however, the driver API is still a much better option because it’s basically guaranteed to always work and avoid DLL problems.

Yeah, but the runtime api is a lot more elegant, which is good in terms of clean and maintable code. I wonder, has someone maybe made a c++ wrapper around the driver api that aimed for ease of use? Or do we have to roll our own?

Does the driver api support loading kernels as ptx files? Or does the program need to use architecture-specific cubins, and update those every time a new generation of hardware is not backward-compatible at a binary level? That is not very robust. One of the nice things about the runtime api is that it stores the ptx (which, albeit, is tied to a bytecode version such as sm_10/sm_11/etc) alongside the compiled machine code (tied to hw version cm_10/cm_11/etc). The runtime api is able to JIT the ptx byte code into machine code on the fly, if the right compiled version has not been stored.

Unfortunately, this advantage of the runtime api is sort of moot right now. To update the JITer for new archs you have to, presumably, update cudart.dll. Yet, since the cudart.dll you use is local to your application, in essence you still have to update your application.

So as it stands, in order for your application to not be guaranteed to break as soon as NVIDIA makes a significant-enough change to the microarchitecture, you need the client to have the latest CUDA Toolkit installed. If you’re using the runtime api, you need the latest cudart.dll, and if you’re using the driver api, you need ptxas.exe.

if you read the nvcc documentation, you’ll see that it is possible for cubins to contain ptx.

in addition, I believe any JIT stuff is contained in the driver, not in cudart.dll. in addition, you must use the cudart.dll present during compilation and not a newer one, which is why you have to redistribute cudart.dll instead of just saying “install the toolkit.”

However, this is not the default behavior, correct?

In any case, I tried to generate such a cubin with this line:

nvcc -cubin --gpu-code=compute_10 […]

and received the following message:

nvcc fatal   : Option '-cubin' is not allowed when compiling for a virtual compute architecture

Ok, I’ve figured out how to get robust JITing of PTX through the Driver API. You need to load not CUBINs, but FATBINs. Page 20 of nvcc_2.0.pdf illustrates how a fatbin combines multiple versions of cubins and ptx. However, the CUDA reference guide has this to say, in its reference section for Driver API function cuModuleLoadFatBinary():

So… until fatbins become available, using the Driver API right now is a guarantee of incompatibility. But you’re right, tmurray, cudart.dll is not necessary for JITing and shouldn’t need updating.

Just so you know, I confirmed today that we’re updating the EULA (I am hoping that it will be in the 2.1 beta, but it’s definitely going to be in 2.1 final) to clarify once and for all that yes, you can redistribute CUDART/CUBLAS/CUFFT dynamic libraries.