Good old NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

I’m having trouble installing the nvidia driver.
I used this guide as a basis. I tried both installation via the ubuntu-drivers tool and manually. I spent several days studying similar topics but so far nothing has helped me.

dkms status

nvidia/525.147.05, 6.5.0-26-generic, x86_64: installed
dpkg -l | grep nvidia

ii  libnvidia-cfg1-525:amd64                    525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-525                        525.147.05-0ubuntu0.22.04.1             all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-525:amd64                 525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-decode-525:amd64                  525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-egl-wayland1:amd64                1:1.1.9-1.1                             amd64        Wayland EGL External Platform library -- shared library
ii  libnvidia-encode-525:amd64                  525.147.05-0ubuntu0.22.04.1             amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-525:amd64                   525.147.05-0ubuntu0.22.04.1             amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-525:amd64                    525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-525:amd64                      525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  linux-modules-nvidia-525-5.15.0-101-generic 5.15.0-101.111+1                        amd64        Linux kernel nvidia modules for version 5.15.0-101
ii  linux-modules-nvidia-525-6.5.0-26-generic   6.5.0-26.26~22.04.1                     amd64        Linux kernel nvidia modules for version 6.5.0-26
ii  linux-modules-nvidia-525-generic            5.15.0-101.111+1                        amd64        Extra drivers for nvidia-525 for the generic flavour
ii  linux-objects-nvidia-525-5.15.0-101-generic 5.15.0-101.111+1                        amd64        Linux kernel nvidia modules for version 5.15.0-101 (objects)
ii  linux-objects-nvidia-525-6.5.0-26-generic   6.5.0-26.26~22.04.1                     amd64        Linux kernel nvidia modules for version 6.5.0-26 (objects)
ii  linux-signatures-nvidia-5.15.0-101-generic  5.15.0-101.111+1                        amd64        Linux kernel signatures for nvidia modules for version 5.15.0-101-generic
ii  linux-signatures-nvidia-6.5.0-26-generic    6.5.0-26.26~22.04.1                     amd64        Linux kernel signatures for nvidia modules for version 6.5.0-26-generic
ii  nvidia-compute-utils-525                    525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA compute utilities
ii  nvidia-dkms-525                             525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA DKMS package
ii  nvidia-driver-525                           525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-525                    525.147.05-0ubuntu0.22.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-525                    525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA kernel source package
ii  nvidia-prime                                0.8.17.2                                all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                             510.47.03-0ubuntu1                      amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-525                            525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                     0.18.2                                  all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-525               525.147.05-0ubuntu0.22.04.1             amd64        NVIDIA binary Xorg driver
sudo prime-select nvidia

Error: no integrated GPU detected.
uname -r

6.5.0-26-generic
lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy
nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
cat /proc/driver/nvidia/version

cat: /proc/driver/nvidia/version: No such file or directory
whereis nvidia

nvidia: /usr/lib/x86_64-linux-gnu/nvidia /usr/lib/nvidia /usr/share/nvidia /usr/src/nvidia-525.147.05/nvidia
mokutil --sb-state

SecureBoot disabled
lspci | egrep 'VGA|3D'

00:0f.0 VGA compatible controller: VMware SVGA II Adapter
03:00.0 3D controller: NVIDIA Corporation GA102GL [A40] (rev a1)

nvidia-bug-report.log.gz (1018.5 KB)

Hello @user150815, welcome back.

Did you by chance install the server version for Ubuntu or some form of data-center installation? The boot process is rather different than normal Ubuntu desktop. Should not be an issue, as long as you have consistent kernel and kernel header versions installed.

But what prompted that question was the presence of the “Fabric manager” which usually only comes with Tesla data-center GPU drivers and NVSwitch based GPU topologies. But you only have one A40, correct?

In any case here you have a mismatch. You installed a v525 driver while the Fabric Manager is v535.

So first thing to try would be to purge all NVIDIA drivers and reinstall just one version.

And if you have physical access to the machine, check if the GPU is seated correctly, has correct power supply and receives adequate cooling. The repeated has fallen off the bus error message in the logs might very well indicate hardware issues like these.

Hello @MarkusHoHo, Yes, in one of the penultimate attempts to get it to work (I was just trying to install 535), I actually installed Fabric Manager and the NSCQ library.
But all the times before that I tried without them, but it still didn’t work.
By the way, each time I deleted the drivers from scratch, so I thought that Fabric Manager and the NSCQ library were also deleted. How can I check if they are installed and how can I remove them?
When I do usual uninstalling sudo apt --purge remove '*nvidia*' I get:
sudo dpkg -l | grep nvidia returns an empty list.
dkms status returns an empty list.

The purge command-line should take care of things. I am not familiar with the Factory Manager though. But as long as all driver versions are in sync that should be fine.

If previous attempts yielded the same log file spam of “GPU has fallen off the bus” I would really try to check the HW for any faults.