How to use CUDA compatibility package to use a newer driver on an older kernel module

I’m trying to deploy a Singularity image (think: Docker but lighter and no root required) on our HPC cluster which uses CUDA. On the compute nodes CUDA 9.2 with driver 396.37 is installed but I’d like to use CUDA 10.x.

If I simply use the official CUDA docker images (nvidia/cuda-ppc64le:10.1-cudnn7-devel-ubuntu18.04) as a base and try to run a program compiled inside the container I get “CUDA driver version is insufficient for CUDA runtime version”

This is due CUDA 10.1 requiring driver >=418.39 so expected. But https://docs.nvidia.com/deploy/cuda-compatibility/index.html describes that on can update all user mode CUDA stuff (runtime + driver) without updating the kernel mode stuff. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#flexible-upgrade-path does mention this too but does not explain HOW to actually do that.

From the first link I tried sudo apt-get install cuda-compat-10.0 in the container and added export LD_LIBRARY_PATH="/usr/local/cuda-10.0/compat:$LD_LIBRARY_PATH" (checked the path for correctness). But this still does not work: Same error running a program and nvidia-smi reports the old driver version.

If I check my application with ldd I don’t see any mentioning of libcuda.so* so the runtime seems to choose the driver differently.

What is required to use the newer driver version without a kernel update?

In a CUDA 10.1 container, you should be installing CUDA 10.1 compatibility package:

sudo apt-get install cuda-compat-10.1

However I’m under the impression that the standard nvidia docker containers for CUDA 10.1 already contain this compatibility structure. Here’s what I did:

I set up a new x86_64 machine with ubuntu 16.04, CUDA 9.2 (driver 396.37), and a Tesla K40c

Next I installed docker, and the nvidia-docker runtime, using the instructions here:

https://docs.nvidia.com/ngc/ngc-titan-setup-guide/index.html#installing-docker-nv-docker

Then I ran the container:

docker run --runtime=nvidia -it nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

And everything “just worked”. Although nvidia-smi in the container still reports driver 396.37 (this is expected), you can compile and run CUDA codes normally using the installed cuda 10.1 toolkit in the container.

I’m further convinced that the compatibility is already built into this particular container, because if I switch the Tesla card to a Quadro GPU, the container fails to load, complaining about the lack of a tesla brand device. This is expected, because the compatibility libraries require this. Note that the compatibility matrix is particular in terms of what works with what, but the CUDA 10.1->CUDA 9.2 (driver 396.xx) is one of the supported paths (Table 3):

https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-application-compatibility

I haven’t tried this on ppc64le (power9) and I haven’t tried it with singularity. With respect to singularity, you may wish to be aware of the documentation here:

https://docs.nvidia.com/ngc/ngc-user-guide/singularity.html#singularity

In a CUDA 10.1 container, you should be installing CUDA 10.1 compatibility package:

Thanks that did solve the problem. I haven’t seen this package previously and it also isn’t mentioned in the docs. A hint there would be nice

However I’m under the impression that the standard nvidia docker containers for CUDA 10.1 already contain this compatibility structure.

Definitely not the containers (see DockerFiles at https://gitlab.com/nvidia/cuda-ppc64le/tree/ubuntu18.04 not installing any compat package and it doesn’t work when converted to singularity containers) so I guess it is nvidia-docker doing some magic there. Maybe anyone can clarify

I’m further convinced that the compatibility is already built into this particular container, because if I switch the Tesla card to a Quadro GPU, the container fails to load

“The container fails to load”? This seems rather a hint at nvidia-docker doing magic, as the container should load but programs should fail if the compat package does not work. Unless I misunderstand what you meant by “fails to load”

Although nvidia-smi in the container still reports driver 396.37 (this is expected)

I’d say this is a bug. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#extended-nvidia-smi states “In order to help both the administrator and the users, the nvidia-smi is enhanced to show the CUDA version in its display. It will use the currently configured paths to determine which CUDA version is being used.”. But that might be because it is using CUDA 9 nvidia-smi which is not “enhanced”

TLDR: Do make it work just do

  • sudo apt-get install cuda-compat-10.1
  • export LD_LIBRARY_PATH="/usr/local/cuda-10.1/compat:$LD_LIBRARY_PATH"

Thanks!

CUDA version and driver version are not the same thing. CUDA version would be gotten using a method similar to what is in the deviceQuery sample code, and this is what is being “modified” by the compatibility library, not anything to do with nvidia-smi (which does not use CUDA at all).

Ah, yes. Must have missed that with all these versions mentioned. Thanks!

The x86_64 containers do appear to load the compat package. Devel inherits from runtime, and runtime inherits from base, and the base dockerfile installs it:

https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.1/base/Dockerfile#L18

I don’t see anything equivalent in the ppc64le container, so that explains why you had to manually load it.

Yes, the container load failure (in my case) was brought about by this line in the dockerfile:

https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.1/base/Dockerfile#L33

That line is trying to enforce the compatibility relationships and requirements expressed in the documentation.

The ppc64le container simply requires CUDA 10.1 or higher in the base machine:

https://gitlab.com/nvidia/cuda-ppc64le/blob/ubuntu18.04/10.1/base/Dockerfile#L29

I’m sure it could be better documented, however the documentation does say this:

https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform

“(replace 10.0 with the appropriate version).”

That should apparently be construed to mean all previous references to 10.0 in that section, not just the 10.0 that precedes it in the same sentence.

It’s interesting to me that you were able to load the ppc64le 10.1 container on a system that had CUDA 9.2 loaded. I don’t think that would work with docker. I wonder if the conversion to singularity is affecting this.

This is sad. I already opened an issue to provide multi-arch images. This would allow to have a single DockerFile for all archs which also avoids the issue of different images among architectures: https://gitlab.com/nvidia/cuda/issues/41

Seems like that. I read it in a way where I assumed the package is a meta-package which would install stuff into either folder depending on which CUDA version is the current/installed one.

Are you referring to the NVIDIA_REQUIRE_CUDA env variable and the failure you described? AFAIK those variables (NVIDIA_*) are only used by nvidia-docker, although I haven’t found any source describing all of them. So a singularity container won’t have this check as it isn’t using nvidia-docker (or any docker at all).