Centos7.9 1160.71 w/nvidia-driver 515.48.07 fails to recognize V100

Hello,

I’m testing the new 515.48.07 drivers on my system. The driver is failing to recognize my GPUs despite the driver README stating that the cards are supported. This issue is resolved by rolling back to 515.43.04. Guidance would be appreciated on how to resolve this issue.

  1. Fresh build, no prior nvidia-driver/cuda installed.
  2. Centos 7.9 Running v1160.71
uname -a
Linux sa1 3.10.0-1160.71.1.el7.x86_64 #1 SMP Tue Jun 28 15:37:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  1. nvidia-driver-latest is installed
yum list installed | grep nvidia
Loaded plugins: fastestmirror, nvidia
kmod-nvidia-open-dkms.x86_64            3:515.48.07-1.el7              installed
nvidia-driver-latest.x86_64             3:515.48.07-1.el7              installed
nvidia-driver-latest-NVML.x86_64        3:515.48.07-1.el7              installed
nvidia-driver-latest-NvFBCOpenGL.x86_64 3:515.48.07-1.el7              installed
nvidia-driver-latest-cuda.x86_64        3:515.48.07-1.el7              installed
nvidia-driver-latest-cuda-libs.x86_64   3:515.48.07-1.el7              installed
nvidia-driver-latest-devel.x86_64       3:515.48.07-1.el7              installed
nvidia-driver-latest-libs.x86_64        3:515.48.07-1.el7              installed
nvidia-modprobe-latest.x86_64           3:515.48.07-1.el7              installed
nvidia-persistenced-latest.x86_64       3:515.48.07-1.el7              installed
nvidia-xconfig-latest.x86_64            3:515.48.07-1.el7              installed
  1. nvidia-persistenced fails to start
systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/usr/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2022-07-28 09:39:19 PDT; 58min ago

Jul 28 09:39:19 sa1 systemd[1]: Unit nvidia-persistenced.service entered failed state.
Jul 28 09:39:19 sa1 systemd[1]: nvidia-persistenced.service failed.
Jul 28 09:39:19 sa1 systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
Jul 28 09:39:19 sa1 systemd[1]: Stopped NVIDIA Persistence Daemon.
Jul 28 09:39:19 sa1 systemd[1]: start request repeated too quickly for nvidia-persistenced.service
Jul 28 09:39:19 sa1 systemd[1]: Failed to start NVIDIA Persistence Daemon.
Jul 28 09:39:19 sa1 systemd[1]: Unit nvidia-persistenced.service entered failed state.
Jul 28 09:39:19 sa1 systemd[1]: nvidia-persistenced.service failed.
  1. /var/log/messages shows that no device is listed in /dev/nvidia
Jul 28 10:38:48 sa1 kernel: [ 3736.677676] nvidia-nvlink: Nvlink Core is being initialized, major device number 230
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: The NVIDIA GPU 0000:61:00.0 (PCI ID: 10de:1db5)
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: installed in this system is not supported by open
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: nvidia.ko because it does not include the required GPU
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: System Processor (GSP).
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: Firmware' sections in the driver README, available on
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: the Linux graphics driver download page at
Jul 28 10:38:48 sa1 kernel: [ 3736.679266] NVRM: www.nvidia.com.
Jul 28 10:38:48 sa1 kernel: [ 3736.679341] nvidia: probe of 0000:61:00.0 failed with error -1
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: The NVIDIA GPU 0000:8a:00.0 (PCI ID: 10de:1db5)
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: installed in this system is not supported by open
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: nvidia.ko because it does not include the required GPU
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: System Processor (GSP).
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: Firmware' sections in the driver README, available on
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: the Linux graphics driver download page at
Jul 28 10:38:48 sa1 kernel: [ 3736.679591] NVRM: www.nvidia.com.
Jul 28 10:38:48 sa1 kernel: [ 3736.679651] nvidia: probe of 0000:8a:00.0 failed with error -1
Jul 28 10:38:48 sa1 kernel: [ 3736.679692] NVRM: The NVIDIA probe routine failed for 2 device(s).
Jul 28 10:38:48 sa1 kernel: [ 3736.679694] NVRM: None of the NVIDIA devices were initialized.
Jul 28 10:38:48 sa1 kernel: [ 3736.679938] nvidia-nvlink: Unregistered Nvlink Core, major device number 230
Jul 28 10:38:48 sa1 nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
  1. 2x V100 cards are attached
lspci -nnn | grep -i nvidia
61:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
8a:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)

The error message pretty much says it all. You chose to install the open source kernel modules which only support Turing and newer. For Volta, please revert to the proprietary kernel modules.

1 Like

Our rpms came from the Nvidia public repo. Does this mean that as of June we will need to build our own rpms?

The open source modules are in
kmod-nvidia-open-dkms-515.48.07-1.el7.x86_64.rpm
while the proprietary modules are in
kmod-nvidia-latest-dkms-515.48.07-1.el7.x86_64.rpm
So you should be able to switch.

Understood. Thank you for the help!