Hi There,
I have a Nutanix based data center and we have Ubuntu 20.04 VMs in it(about 4 of them). We have been encountering Nvidia driver failure recently in each of the VMs one by one in last 2 weeks and now none of them are working.
> nvidia-smi
returns
> NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
The GPU model we are using is GV100GL [Tesla V100 PCIe 32GB]
Kernel version now is 5.15.0-52-generic
Driver version is 515.65.01
Note that the GPUs are directly attached and not virtualized.
Though I have given some of the versions above, I have tried so many different combinations of kernel versions and driver versions.
In syslog, I see the following error now but I think I was working earlier and we were using it for some time now
> Nov 30 20:56:45 NEV000GPUD03 systemd[1]: Failed to start NVIDIA Persistence Daemon.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426542] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: The NVIDIA GPU 0000:00:06.0 (PCI ID: 10de:1db6)
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: installed in this system is not supported by the
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: NVIDIA 515.65.01 driver release.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: in this release’s README, available on the operating system
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: specific graphics driver download page at www.nvidia.com.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465291] nvidia: probe of 0000:00:06.0 failed with error -1
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465325] NVRM: The NVIDIA probe routine failed for 1 device(s).
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465326] NVRM: None of the NVIDIA devices were initialized.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465726] nvidia-nvlink: Unregistered Nvlink Core, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 systemd-udevd[8486]: nvidia: Process ‘/sbin/modprobe nvidia-modeset’ failed with exit code 1.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567323] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: The NVIDIA GPU 0000:00:06.0 (PCI ID: 10de:1db6)
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: installed in this system is not supported by the
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: NVIDIA 515.65.01 driver release.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
lsmod | grep nvidia does not return anything
systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-11-30 21:40:25 +03; 367ms ago
Process: 131810 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, status=1/FAILURE)
Process: 131820 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
sudo modprobe nvidia -vv
modprobe: INFO: …/libkmod/libkmod.c:365 kmod_set_log_fn() custom logging function 0x55eb9ca86c70 registered
insmod /lib/modules/5.15.0-52-generic/updates/dkms/nvidia.ko
modprobe: INFO: …/libkmod/libkmod-module.c:892 kmod_module_insert_module() Failed to insert module ‘/lib/modules/5.15.0-52-generic/updates/dkms/nvidia.ko’: No such device
modprobe: ERROR: could not insert ‘nvidia’: No such device
modprobe: INFO: …/libkmod/libkmod.c:332 kmod_unref() context 0x55eb9d9f5480 released
I have followed more than 100 solutions given in various forms including purging, reinstallation, reinstallation of linux headers, reinstallation of gcc etc.
I can provide the list of links here but that is a long long list.
Please let me know if anyone can help.