I am getting “No devices were found” when running the nvidia-smi I have been in the process of upgrading the Nvidia driver and kernal to 515.86.01 and the cuda toolkit to 11.7.1.
We had everything working and the our Linux admins updated the OS to a new release for RHEL 7.9. We then started getting a “Failed to initialize NVML: Driver/library version mismatch” error and solved it at least temporarily with How to prevent API mismatch
Our linux admins subsequently updated the OS again and it once again we ended up with the mismatch. The linux admin fixed the incorrect driver from being loaded in the kernal using the following solution.
Loading the correct version of the kernel module is the first thing that you should do when you see the nvidia nvml driver/library version. To load the correct version of the kernel, use the following steps:
- Open your terminal.
- List all the loaded Nvidia drivers using the following: lsmod | grep nvidia
- Inspect the output of the previous commands, it should contain Unified Memory Kernel (nvidia_uvm), Direct Rendering Manager (nvidia_drm), nvidia_modeset, and nvidia.
- Unload nvidia and all its dependencies by running each of the following commands: “sudo rmmod nvidia_drm”, “sudo rmmod nvidia_modeset” and “sudo rmmod nvidia_uvm”
- Troubleshoot any rmmod: error: module nvidia_drm is in use using the following: sudo lsof /dev/nvidia*
- Kill all the related Nvidia processes and unload the remaining dependencies.
- Unload “nvidia” itself using the following: sudo rmmod nvidia
- Confirm that you’ve unloaded the kernel modules if the output of the following returns empty: “lsmod | grep nvidia”.
- Confirm that you can load the correct driver using the Nvidia System Management Interface “nvidia-smi”.
Everything was good for a few hours, but now when I run the nvidia-smi I get the “No devices found” I can see the devices :
(base) [root@paidsrfchtc01 nvidia]# sudo lshw -C display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:03:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller cap_list rom
configuration: driver=ast latency=0
resources: irq:17 memory:90000000-90ffffff memory:91000000-9101ffff ioport:3000(size=128)
*-display
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:5e:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:38fd0-38fcf iomemory:38fe0-38fdf irq:186 memory:a5000000-a5ffffff memory:38fdc0000000-38fdcfffffff memory:38fed0000000-38fed1ffffff memory:a6000000-a63fffff memory:38fdd0000000-38fecfffffff memory:38fed2000000-38fef1ffffff
*-display
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:af:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:39bd0-39bcf iomemory:39be0-39bdf irq:187 memory:ce000000-ceffffff memory:39bdc0000000-39bdcfffffff memory:39bed0000000-39bed1ffffff memory:cf0
We show the right driver and kernel, but the dkms is looking for the 450.51.06
(base) [root@xxxxxxxxx nvidia]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GCC version: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
(base) [root@xxxxxxxx nvidia]# cat /sys/module/nvidia/version
515.86.01
(base) [root@xxxxxxxx nvidia]# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/450.51.06/source/dkms.conf does not exist.
(base) [root@paidsrfchtc01 nvidia]#
I have uploaded a nvidia bug report.
nvidia-bug-report.log.gz (844.0 KB)