NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

Hi There,

I have a Nutanix based data center and we have Ubuntu 20.04 VMs in it(about 4 of them). We have been encountering Nvidia driver failure recently in each of the VMs one by one in last 2 weeks and now none of them are working.

> nvidia-smi

returns

> NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

The GPU model we are using is GV100GL [Tesla V100 PCIe 32GB]
Kernel version now is 5.15.0-52-generic
Driver version is 515.65.01
Note that the GPUs are directly attached and not virtualized.

Though I have given some of the versions above, I have tried so many different combinations of kernel versions and driver versions.

In syslog, I see the following error now but I think I was working earlier and we were using it for some time now

> Nov 30 20:56:45 NEV000GPUD03 systemd[1]: Failed to start NVIDIA Persistence Daemon.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426542] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: The NVIDIA GPU 0000:00:06.0 (PCI ID: 10de:1db6)
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: installed in this system is not supported by the
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: NVIDIA 515.65.01 driver release.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: in this release’s README, available on the operating system
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.426550] NVRM: specific graphics driver download page at www.nvidia.com.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465291] nvidia: probe of 0000:00:06.0 failed with error -1
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465325] NVRM: The NVIDIA probe routine failed for 1 device(s).
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465326] NVRM: None of the NVIDIA devices were initialized.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.465726] nvidia-nvlink: Unregistered Nvlink Core, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 systemd-udevd[8486]: nvidia: Process ‘/sbin/modprobe nvidia-modeset’ failed with exit code 1.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567323] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: The NVIDIA GPU 0000:00:06.0 (PCI ID: 10de:1db6)
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: installed in this system is not supported by the
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: NVIDIA 515.65.01 driver release.
> Nov 30 20:56:46 NEV000GPUD03 kernel: [ 1680.567330] NVRM: Please see ‘Appendix A - Supported NVIDIA GPU Products’

lsmod | grep nvidia does not return anything

systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-11-30 21:40:25 +03; 367ms ago
Process: 131810 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, status=1/FAILURE)
Process: 131820 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)

sudo modprobe nvidia -vv
modprobe: INFO: …/libkmod/libkmod.c:365 kmod_set_log_fn() custom logging function 0x55eb9ca86c70 registered
insmod /lib/modules/5.15.0-52-generic/updates/dkms/nvidia.ko
modprobe: INFO: …/libkmod/libkmod-module.c:892 kmod_module_insert_module() Failed to insert module ‘/lib/modules/5.15.0-52-generic/updates/dkms/nvidia.ko’: No such device
modprobe: ERROR: could not insert ‘nvidia’: No such device
modprobe: INFO: …/libkmod/libkmod.c:332 kmod_unref() context 0x55eb9d9f5480 released

I have followed more than 100 solutions given in various forms including purging, reinstallation, reinstallation of linux headers, reinstallation of gcc etc.

I can provide the list of links here but that is a long long list.

Please let me know if anyone can help.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thank you for the reply!

I have done a few more tries and attached are the details including bug report.

nvidia-bug-report.log.gz (139.2 KB)
nvidia details (6.3 KB) (some more details from shell)

Note that I have changed the versions etc after my initial message in the thread in the process of trying out different combinations.(desperate attempts :-) )

Since I was getting and unsupported version of the driver error, I had tried manually installing 460 version of the driver also using run file.

The issue is still there unfortunately.

Please let me know if you need more details.

Thanks in advance!
Sunil

This seems to be a vGPU based system so the normal driver can’t be used. You’ll need to use the GRID driver, this should be provided by the cloud provider (Nutanix).

Thank you for your reply!. Let me double check this with the vendor(don’t have direct access).
The thing that is still mysterious to me is the fact that how was it working so far. We have been using these machines for around 6 months and only recently it started failing. We have been using nivida driver 46, 510 etc.
I will get back with more details.

Contacting your cloud provider might be best, apart from being a vgpu, it might also be broken but I can’t see details from within a vm.

Just an update. I had a discussion with the cloud support team and they provided a grid driver (NVIDIA-GRID-Linux-KVM-450.191-450.191.01-453.51) package. This is working fine at the moment.
Though this has solved the primary issue, I am still confused on how it was working with the previously installed driver (460.106.00) and why it is not working anymore.
Anyway, for now, there is a solution.

Thanks for the inputs!!

Only explanation is your vm got moved to a different host.

Hi. I do get the same error. My situation is I use dual boot on my windows 10 pro with ubuntu 18.04. My gpu is Geforce RTX 4080 and I want to use it on Ubuntu but when I use command nvidia-smi it shows the same output as above. Already have nvidia driver 525 but it didn’t work. It seems my gpu does not connect to ubuntu. What should I do?
Here are the details:
NVIDIA Linux Graphic Driver, version 525.60.11
Ubuntu 18.04.6 LTS
13th Gen Intel Core i7-13700k
Windows 10 pro
nvidia-bug-report.log.gz (156.1 KB)