I have a RHEL 8.7 server with some A40 GPUs; I’m unable to get Nvidia drivers working on this machine.
I used the rpm (network) method for RHEL 8 on the CUDA download page; I added the repo and ran:
sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda
and these commands all completed successfully. I then rebooted, but still don’t have an nvidia driver I can use:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Troubleshooting I’ve done so far:
Checking to make sure nouveau isn’t loaded:
lsmod|grep -i nouveau
Checking if nvidia is loaded, it’s not:
lsmod|grep -i nvidia
A quick sanity check to make sure I have the Nvidia A40 cards on this machine I think I have:
lspci|grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
Checking the modprobe directories and things look ok:
grep -ir nvidia /etc/modprobe.d/
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
grep -ir nvidia /lib/modprobe.d/
/lib/modprobe.d/dist-blacklist.conf:blacklist nvidiafb
/lib/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
/lib/modprobe.d/nvidia.conf:# Make a soft dependency for nvidia-uvm as adding the module loading to
/lib/modprobe.d/nvidia.conf:# /usr/lib/modules-load.d/nvidia-uvm.conf for systemd consumption, makes the
/lib/modprobe.d/nvidia.conf:softdep nvidia post: nvidia-uvm
/lib/modprobe.d/nvidia.conf:options nvidia NVreg_DynamicPowerManagement=0x02
/lib/modprobe.d/nvidia.conf:# Fedora disables Wayland if detecting the Nvidia driver.
/lib/modprobe.d/nvidia.conf:# options nvidia-drm modeset=1
Checking to see which nvidia packages I ended up with, I don’t see anything wrong here either:
rpm -qa|grep -i nvidia
nvidia-libXNVCtrl-530.30.02-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-530.30.02-1.el8.x86_64
nvidia-modprobe-530.30.02-1.el8.x86_64
kmod-nvidia-latest-dkms-530.30.02-1.el8.x86_64
nvidia-driver-libs-530.30.02-1.el8.x86_64
nvidia-kmod-common-530.30.02-1.el8.noarch
nvidia-xconfig-530.30.02-1.el8.x86_64
nvidia-libXNVCtrl-devel-530.30.02-1.el8.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-cuda-libs-530.30.02-1.el8.x86_64
nvidia-persistenced-530.30.02-1.el8.x86_64
nvidia-driver-530.30.02-1.el8.x86_64
nvidia-driver-cuda-530.30.02-1.el8.x86_64
nvidia-driver-NVML-530.30.02-1.el8.x86_64
nvidia-driver-devel-530.30.02-1.el8.x86_64
nvidia-settings-530.30.02-1.el8.x86_64
dmesg is showing an error:
NVRM: No NVIDIA devices probed.
[ 8.528141] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[ 8.718847] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 8.718854] NVRM: The NVIDIA probe routine was not called for 4 device(s).
[ 8.721414] NVRM: This can occur when a driver such as:
NVRM: nouveau, rivafb, nvidiafb or rivatv
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[ 8.721415] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
But I don’t see any conflicting drivers loaded:
lsmod|grep -E "nouveau|rivafb|nvidiafb|rivatv"
Am I missing something obvious here? I attached the output of nvidia-bug-report.sh as well. Any assistance would be greatly appreciated!
nvidia-bug-report.log.gz (106.7 KB)