No devices were found" when running the nvidia-smi

I am getting “No devices were found” when running the nvidia-smi I have been in the process of upgrading the Nvidia driver and kernal to 515.86.01 and the cuda toolkit to 11.7.1.

We had everything working and the our Linux admins updated the OS to a new release for RHEL 7.9. We then started getting a “Failed to initialize NVML: Driver/library version mismatch” error and solved it at least temporarily with How to prevent API mismatch

Our linux admins subsequently updated the OS again and it once again we ended up with the mismatch. The linux admin fixed the incorrect driver from being loaded in the kernal using the following solution.

Loading the correct version of the kernel module is the first thing that you should do when you see the nvidia nvml driver/library version. To load the correct version of the kernel, use the following steps:

  1. Open your terminal.
  2. List all the loaded Nvidia drivers using the following: lsmod | grep nvidia
  3. Inspect the output of the previous commands, it should contain Unified Memory Kernel (nvidia_uvm), Direct Rendering Manager (nvidia_drm), nvidia_modeset, and nvidia.
  4. Unload nvidia and all its dependencies by running each of the following commands: “sudo rmmod nvidia_drm”, “sudo rmmod nvidia_modeset” and “sudo rmmod nvidia_uvm”
  5. Troubleshoot any rmmod: error: module nvidia_drm is in use using the following: sudo lsof /dev/nvidia*
  6. Kill all the related Nvidia processes and unload the remaining dependencies.
  7. Unload “nvidia” itself using the following: sudo rmmod nvidia
  8. Confirm that you’ve unloaded the kernel modules if the output of the following returns empty: “lsmod | grep nvidia”.
  9. Confirm that you can load the correct driver using the Nvidia System Management Interface “nvidia-smi”.

Everything was good for a few hours, but now when I run the nvidia-smi I get the “No devices found” I can see the devices :

(base) [root@paidsrfchtc01 nvidia]# sudo lshw -C display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:03:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller cap_list rom
configuration: driver=ast latency=0
resources: irq:17 memory:90000000-90ffffff memory:91000000-9101ffff ioport:3000(size=128)
*-display
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:5e:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:38fd0-38fcf iomemory:38fe0-38fdf irq:186 memory:a5000000-a5ffffff memory:38fdc0000000-38fdcfffffff memory:38fed0000000-38fed1ffffff memory:a6000000-a63fffff memory:38fdd0000000-38fecfffffff memory:38fed2000000-38fef1ffffff
*-display
description: 3D controller
product: TU104GL [Tesla T4]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:af:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list
configuration: driver=nvidia latency=0
resources: iomemory:39bd0-39bcf iomemory:39be0-39bdf irq:187 memory:ce000000-ceffffff memory:39bdc0000000-39bdcfffffff memory:39bed0000000-39bed1ffffff memory:cf0

We show the right driver and kernel, but the dkms is looking for the 450.51.06

(base) [root@xxxxxxxxx nvidia]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.86.01 Wed Oct 26 09:12:38 UTC 2022
GCC version: gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
(base) [root@xxxxxxxx nvidia]# cat /sys/module/nvidia/version
515.86.01
(base) [root@xxxxxxxx nvidia]# dkms status
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/450.51.06/source/dkms.conf does not exist.
(base) [root@paidsrfchtc01 nvidia]#

I have uploaded a nvidia bug report.

nvidia-bug-report.log.gz (844.0 KB)

Didn’t look at the bug report (yet).
What’s the contents of /var/lib/dkms/nvidia?

ls -l /var/lib/dkms/nvidia

Are the sources still in /usr/src?

[99151.821156] NVRM: GPU 0000:5e:00.0: RmInitAdapter failed! (0x63:0xffff:2338)
[99151.823536] NVRM: GPU 0000:5e:00.0: rm_init_adapter failed, device minor number 0
[99151.824292] nvidia 0000:af:00.0: irq 188 for MSI/MSI-X
[99151.824326] nvidia 0000:af:00.0: irq 189 for MSI/MSI-X
[99151.824348] nvidia 0000:af:00.0: irq 190 for MSI/MSI-X
[99151.824376] nvidia 0000:af:00.0: irq 191 for MSI/MSI-X
[99151.824399] nvidia 0000:af:00.0: irq 192 for MSI/MSI-X
[99151.824426] nvidia 0000:af:00.0: irq 193 for MSI/MSI-X
[99152.116487] NVRM: GPU 0000:af:00.0: RmInitAdapter failed! (0x63:0xffff:2338)
[99152.118873] NVRM: GPU 0000:af:00.0: rm_init_adapter failed, device minor number 1
[99152.119572] nvidia 0000:af:00.0: irq 188 for MSI/MSI-X
[99152.119606] nvidia 0000:af:00.0: irq 189 for MSI/MSI-X
[99152.119624] nvidia 0000:af:00.0: irq 190 for MSI/MSI-X
[99152.119649] nvidia 0000:af:00.0: irq 191 for MSI/MSI-X
[99152.119670] nvidia 0000:af:00.0: irq 192 for MSI/MSI-X
[99152.119699] nvidia 0000:af:00.0: irq 193 for MSI/MSI-X

The log is flooded with these message about failing to initialize the gpus.
Not even a full dmesg output is visible.
Not sure what’s going on.
But as it’s suddenly happend I’d start looking into the hardware side of things.
Reseat the cards and check the cabling.
If possible, test them in another system.

We are on Redhat linux so we are using the rpm or .run files (not dpkg/apt) to install/deinstall the driver and cuda toolkit. I also had downloaded the 450 version and did an uninstall and it just uninstalled the 515 version of the driver. There was no listing in the rpm repository or the yum history that showed the 450 being installed. I removed all the cuda, kernal, and driver components using the run files.

Here are the contents of those directories:

(base) [root@xxxxxxxxxx src]# ls -l /var/lib/dkms/nvidia
total 4
drwxr-xr-x 4 root root 4096 May 8 2022 450.51.06

(base) [root@xxxxxxxxxxxx src]# ls -al /usr/src
total 36
drwxr-xr-x. 7 root root 4096 Aug 8 14:51 .
drwxr-xr-x. 14 root root 4096 Feb 17 2021 …
drwxr-xr-x. 3 root root 4096 Jun 13 16:58 debug
drwxr-xr-x. 5 root root 4096 Aug 6 13:45 kernels
drwxrwxr-x. 24 root root 4096 Feb 11 2020 linux-recomp
-rw-r–r–. 1 root root 4162 Oct 29 2019 nvidia-418–patch-for-supporting-5.2plus-kernels.diff
drwxr-xr-x 8 root root 4096 Aug 8 14:51 nvidia-515.86.01
drwxr-xr-x 2 root root 4096 Apr 12 19:19 nvidia-open-530.30.02

I do see there was an open version installed in April when we were trying to update the driver to mitigate a security vulnerability. The admin installed too high of a version to be compatible with our application setup so I had him remove it and replace it with the 515.

does

sudo dkms uninstall -m nvidia -v 450.51.06
sudo dkms remove -m nvidia -v 450.51.06

work?

Otherwise try to move the /var/lib/dkms/nvidia/450.51.06 for testing to another place and then check dkms status again.

(base) [root@xxxxxxxxx src]# dkms uninstall -m nvidia -v 450.51.06
Module nvidia 450.51.06 is not installed for kernel 3.10.0-1160.92.1.el7.x86_64 (x86_64). Skipping…
(base) [root@xxxxxxxx src]# dkms remove -m nvidia -v 450.51.06
Module nvidia 450.51.06 is not installed for kernel 3.10.0-1160.92.1.el7.x86_64 (x86_64). Skipping…
Module nvidia 450.51.06 is not built for kernel 3.10.0-1160.92.1.el7.x86_64 (x86_64). Skipping…
(base) [root@xxxxxxx src]#

I moved the directory and the dkms status shows nothing now.

(base) [root@xxxxxxx nvidia]# dkms status
(base) [root@xxxxxxxx nvidia]#

How about trying to reinstall the 515 modules?

dkms install -m nvidia -v 515.86.01

(base) [root@xxxxxxxx ]# dkms install -m nvidia -v 515.86.01
Sign command: /lib/modules/3.10.0-1160.92.1.el7.x86_64/build/scripts/sign-file
Signing key: /var/lib/dkms/mok.key
Public certificate (MOK): /var/lib/dkms/mok.pub
Creating symlink /var/lib/dkms/nvidia/515.86.01/source → /usr/src/nvidia-515.86.01

Building module:
Cleaning build area…
‘make’ -j32 NV_EXCLUDE_BUILD_MODULES=‘’ KERNEL_UNAME=3.10.0-1160.92.1.el7.x86_64 modules…
/sbin/dkms: line 1121: cd: /home/z1164718: Permission denied
Signing module /var/lib/dkms/nvidia/515.86.01/build/nvidia.ko
Signing module /var/lib/dkms/nvidia/515.86.01/build/nvidia-uvm.ko
Signing module /var/lib/dkms/nvidia/515.86.01/build/nvidia-modeset.ko
Signing module /var/lib/dkms/nvidia/515.86.01/build/nvidia-drm.ko
Signing module /var/lib/dkms/nvidia/515.86.01/build/nvidia-peermem.ko
Cleaning build area…

nvidia.ko.xz:
Running module version sanity check.
Module version 515.86.01 for nvidia.ko.xz
exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.
DKMS will not replace this module.
You may override by specifying --force.

nvidia-uvm.ko.xz:
Running module version sanity check.
Module version for nvidia-uvm.ko.xz
exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.
DKMS will not replace this module.
You may override by specifying --force.

nvidia-modeset.ko.xz:
Running module version sanity check.
Module version 515.86.01 for nvidia-modeset.ko.xz
exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.
DKMS will not replace this module.
You may override by specifying --force.

nvidia-drm.ko.xz:
Running module version sanity check.
Module version 515.86.01 for nvidia-drm.ko.xz
exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.
DKMS will not replace this module.
You may override by specifying --force.

nvidia-peermem.ko.xz:
Running module version sanity check.
Module version 515.86.01 for nvidia-peermem.ko.xz
exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.
DKMS will not replace this module.
You may override by specifying --force.
Error! Installation aborted.
(base) [root@paidsrfchtc01 z1164718]#

What’s that about?
You might open /sbin/dkms and investigate.

So does dkms status now show the 515 driver?
If not I’d do a uninstall + remove and the again an install…

Sorry, I was in my home directory and root does not have privileges. Reran from / directory. The 515 is now showing, however still have the no devices found running the nvidia-smi.

(base) [root@xxxxxxxx /]# dkms install -m nvidia -v 515.86.01

nvidia.ko.xz:

Running module version sanity check.

Module version 515.86.01 for nvidia.ko.xz

exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.

DKMS will not replace this module.

You may override by specifying --force.

nvidia-uvm.ko.xz:

Running module version sanity check.

Module version for nvidia-uvm.ko.xz

exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.

DKMS will not replace this module.

You may override by specifying --force.

nvidia-modeset.ko.xz:

Running module version sanity check.

Module version 515.86.01 for nvidia-modeset.ko.xz

exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.

DKMS will not replace this module.

You may override by specifying --force.

nvidia-drm.ko.xz:

Running module version sanity check.

Module version 515.86.01 for nvidia-drm.ko.xz

exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.

DKMS will not replace this module.

You may override by specifying --force.

nvidia-peermem.ko.xz:

Running module version sanity check.

Module version 515.86.01 for nvidia-peermem.ko.xz

exactly matches what is already found in kernel 3.10.0-1160.92.1.el7.x86_64.

DKMS will not replace this module.

(base) [root@xxxxxxx /]# dkms status

nvidia/515.86.01, 3.10.0-1160.92.1.el7.x86_64, x86_64: built

As before dkms didn’t install anything.
I suggest you use the --force parameter for install.
Then try a reboot and immediately after create a new bug report.

Is it necessary to force and reboot it when it is finding the same module versions in the kernel for 515.86.01 and the 515.86.01 is showing in both of the commands below:

(base) [root@xxxxxxxx /]# dkms status
nvidia/515.86.01, 3.10.0-1160.92.1.el7.x86_64, x86_64: built

ls -l /var/lib/dkms/nvidia
total 4
drwxr-xr-x 3 root root 4096 Aug 10 13:46 515.86.01

or will a reboot suffice?

I don’t know. But at least dkms status should then show the correct status.

I looked at the source of dkms quickly, and as far as I understand, it’ll run depmod.
Doesn’t hurt and maybe a good idea if things got somehow messed up.

I went ahead and added the --force parameter and yes it did run the depmod. I rebooted and the GPUs have been online for the last 3 hours. I will continue to monitor the status before declaring victory as we have tried multiple fixes for the issue, but looking good so far.

The issue seems to be fixed.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.