Installing additional CUDA versions

I want to install multiple software on our Rocky Linux 8 cluster that is reliant on pytorch, dgl-cuda and cuda - more specifically versions 11.6-11.8 - but the only cuda versions I have currently installed are 11.2, 12.1 and 12.2

I read that usually having multiple cuda versions installed shouldn’t be a problem, but I want to ensure that 12.1 remains the “main” one in use and installing a new one doesn’t brick the system. When I tried to install it with the runfile it showed

Existing package manager installation of the driver found. It is strongly recommended that you remove this before continuing.

And when I tried the rpm instead I got this:

[admin@cluster newcuda]$ sudo dnf -y module install nvidia-driver:latest-dkms
Warning: failed loading '/etc/yum.repos.d/oneAPI.repo', skipping.
Rocky Linux 8 - AppStream                        17 MB/s |  11 MB     00:00    
Rocky Linux 8 - BaseOS                           14 MB/s | 7.1 MB     00:00    
Rocky Linux 8 - PowerTools - Source             1.8 MB/s | 655 kB     00:00    
Rocky Linux 8 - Extras                           53 kB/s |  14 kB     00:00    
Rocky Linux 8 - PowerTools                      5.7 MB/s | 2.8 MB     00:00    
Rocky Linux 8 - PowerTools - Source             557 kB/s | 197 kB     00:00    
cuda-rhel8-x86_64                                16 MB/s | 2.7 MB     00:00    
cuda-rhel8-11-1-local                            26 MB/s |  70 kB     00:00    
cuda-rhel8-11-2-local                            30 MB/s |  72 kB     00:00    
cuda-rhel8-11-7-local                            44 MB/s |  87 kB     00:00    
cuda-rhel8-12-1-local                            36 MB/s |  94 kB     00:00    
ELRepo.org Community Enterprise Linux Repositor 399 kB/s | 243 kB     00:00    
Extra Packages for Enterprise Linux 8 - x86_64   12 MB/s |  16 MB     00:01    
Extra Packages for Enterprise Linux 8 - Next -  1.4 MB/s | 368 kB     00:00    
NVIDIA HPC SDK                                   19 MB/s | 3.1 MB     00:00    
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-530.30.02-4.18.0-477.27.1 for kernel version 4.18.0-477.27.1.el8_8 and NVIDIA driver 535.86.10 could be found
Error: 
 Problem: problem with installed package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64
  - package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64 conflicts with kmod-nvidia-latest-dkms provided by kmod-nvidia-latest-dkms-3:535.104.12-1.el8.x86_64
  - cannot install the best candidate for the job
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)

Which i fear might cause problems because it wants to erase kernel drivers for 12.1 (?)

Do you have a suggestion on how I can proceed? The software in particular I was trying to install is called RFdiffusion and the error message said this:

 File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 9, in _fail
    raise RuntimeError("NVTX functions not installed. Are you sure you have a CUDA build?")
RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?
1 Like

If I were doing this I would use the runfile installer to install older versions (feel free to use the RPM/package manager method if you wish, for your latest version, etc.)

A full CUDA install, whether by package manager or runfile installer, can/will install both the CUDA toolkit as well as the GPU driver. The driver install is the sticky point, that can present conflicts between the runfile and package manager methods. The CUDA toolkit portion generally won’t conflict. This is more-or-less evident in the warning message you excerpted:

Existing package manager installation of the driver found.

So:

  1. Install the latest version of CUDA (and the GPU driver) using either package manager or runfile installer method.
  2. Install older versions of CUDA (toolkit) using runfile installers. Deselect the option to install the driver during this step.

I don’t really know how to do this using purely package manager methods. There may be a way, I just don’t know it. Using the package manager, if you already have a suitable GPU driver installed, you can use the package manager to install only the cuda toolkit portion (not the GPU driver) using instead of dnf install cuda, you guessed it, dnf install cuda-toolkit. You can also install older versions of the toolkit using other meta packages (e.g. dnf install cuda-toolkit-10-2). However I personally don’t know how to install multiple cuda toolkits this way. It may just work, but I think it does not, without extra magic.

There is an install guide available. I suggest reading it. The post install steps are one example of something I have not covered here, but are usually necessary for best functionality.

1 Like