I want to install multiple software on our Rocky Linux 8 cluster that is reliant on pytorch, dgl-cuda and cuda - more specifically versions 11.6-11.8 - but the only cuda versions I have currently installed are 11.2, 12.1 and 12.2
I read that usually having multiple cuda versions installed shouldn’t be a problem, but I want to ensure that 12.1 remains the “main” one in use and installing a new one doesn’t brick the system. When I tried to install it with the runfile it showed
Existing package manager installation of the driver found. It is strongly recommended that you remove this before continuing.
And when I tried the rpm instead I got this:
[admin@cluster newcuda]$ sudo dnf -y module install nvidia-driver:latest-dkms
Warning: failed loading '/etc/yum.repos.d/oneAPI.repo', skipping.
Rocky Linux 8 - AppStream 17 MB/s | 11 MB 00:00
Rocky Linux 8 - BaseOS 14 MB/s | 7.1 MB 00:00
Rocky Linux 8 - PowerTools - Source 1.8 MB/s | 655 kB 00:00
Rocky Linux 8 - Extras 53 kB/s | 14 kB 00:00
Rocky Linux 8 - PowerTools 5.7 MB/s | 2.8 MB 00:00
Rocky Linux 8 - PowerTools - Source 557 kB/s | 197 kB 00:00
cuda-rhel8-x86_64 16 MB/s | 2.7 MB 00:00
cuda-rhel8-11-1-local 26 MB/s | 70 kB 00:00
cuda-rhel8-11-2-local 30 MB/s | 72 kB 00:00
cuda-rhel8-11-7-local 44 MB/s | 87 kB 00:00
cuda-rhel8-12-1-local 36 MB/s | 94 kB 00:00
ELRepo.org Community Enterprise Linux Repositor 399 kB/s | 243 kB 00:00
Extra Packages for Enterprise Linux 8 - x86_64 12 MB/s | 16 MB 00:01
Extra Packages for Enterprise Linux 8 - Next - 1.4 MB/s | 368 kB 00:00
NVIDIA HPC SDK 19 MB/s | 3.1 MB 00:00
NOTE: Skipping kernel installation since no kernel module package kmod-nvidia-530.30.02-4.18.0-477.27.1 for kernel version 4.18.0-477.27.1.el8_8 and NVIDIA driver 535.86.10 could be found
Error:
Problem: problem with installed package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64
- package kmod-nvidia-535.86.10-4.18.0-477.21.1-3:535.86.10-3.el8_8.x86_64 conflicts with kmod-nvidia-latest-dkms provided by kmod-nvidia-latest-dkms-3:535.104.12-1.el8.x86_64
- cannot install the best candidate for the job
(try to add '--allowerasing' to command line to replace conflicting packages or '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
Which i fear might cause problems because it wants to erase kernel drivers for 12.1 (?)
Do you have a suggestion on how I can proceed? The software in particular I was trying to install is called RFdiffusion and the error message said this:
File "/software/anaconda/envs/SE3nv/lib/python3.9/site-packages/torch/cuda/nvtx.py", line 9, in _fail
raise RuntimeError("NVTX functions not installed. Are you sure you have a CUDA build?")
RuntimeError: NVTX functions not installed. Are you sure you have a CUDA build?