Nvidia driver not loading after installation

I have a RHEL 8.7 server with some A40 GPUs; I’m unable to get Nvidia drivers working on this machine.

I used the rpm (network) method for RHEL 8 on the CUDA download page; I added the repo and ran:

sudo dnf clean all
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda

and these commands all completed successfully. I then rebooted, but still don’t have an nvidia driver I can use:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Troubleshooting I’ve done so far:

Checking to make sure nouveau isn’t loaded:

lsmod|grep -i nouveau

Checking if nvidia is loaded, it’s not:

lsmod|grep -i nvidia

A quick sanity check to make sure I have the Nvidia A40 cards on this machine I think I have:

lspci|grep -i nvidia
17:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
65:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
ca:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)
e3:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)

Checking the modprobe directories and things look ok:

grep -ir nvidia /etc/modprobe.d/
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

grep -ir nvidia /lib/modprobe.d/
/lib/modprobe.d/dist-blacklist.conf:blacklist nvidiafb
/lib/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
/lib/modprobe.d/nvidia.conf:# Make a soft dependency for nvidia-uvm as adding the module loading to
/lib/modprobe.d/nvidia.conf:# /usr/lib/modules-load.d/nvidia-uvm.conf for systemd consumption, makes the
/lib/modprobe.d/nvidia.conf:softdep nvidia post: nvidia-uvm
/lib/modprobe.d/nvidia.conf:options nvidia NVreg_DynamicPowerManagement=0x02
/lib/modprobe.d/nvidia.conf:# Fedora disables Wayland if detecting the Nvidia driver.
/lib/modprobe.d/nvidia.conf:# options nvidia-drm modeset=1

Checking to see which nvidia packages I ended up with, I don’t see anything wrong here either:

rpm -qa|grep -i nvidia

dmesg is showing an error:

NVRM: No NVIDIA devices probed.
[    8.528141] nvidia-nvlink: Unregistered Nvlink Core, major device number 236
[    8.718847] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[    8.718854] NVRM: The NVIDIA probe routine was not called for 4 device(s).
[    8.721414] NVRM: This can occur when a driver such as:
               NVRM: nouveau, rivafb, nvidiafb or rivatv
               NVRM: was loaded and obtained ownership of the NVIDIA device(s).
[    8.721415] NVRM: Try unloading the conflicting kernel module (and/or
               NVRM: reconfigure your kernel without the conflicting
               NVRM: driver(s)), then try loading the NVIDIA kernel module
               NVRM: again.

But I don’t see any conflicting drivers loaded:

lsmod|grep -E "nouveau|rivafb|nvidiafb|rivatv"

Am I missing something obvious here? I attached the output of nvidia-bug-report.sh as well. Any assistance would be greatly appreciated!

nvidia-bug-report.log.gz (106.7 KB)

All nvidia gpus are bound to the vfio-pci driver, i.e. set up for pass-through to a vm.

Thank you so much!

For anyone else who comes across this issue, I had to remove intel_iommu=on iommu=pt from /etc/default/grub (it was already gone from one of the servers I was working on, but not the other. I suspect one just needed GRUB regenerated).

Then I ran this to regenerate GRUB:

grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

If you were using a BIOS instead of a UEFI system, you would instead use this to regenerate GRUB:

grub2-mkconfig -o /boot/grub2/grub.cfg

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.