NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

As my part of my job, I did some maintenance to the servers I manage, all of them are ubuntu 18.04 or 20.*
Last week I used Ansible to update mounts, update & upgrade packages, and reboot them.
I had to leave early but got calls that tasks didn’t run, while all the servers have nivida-container-runtime.
After checking trying type ‘nvidia-smi’ got this message:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I want to give more details about hardware or OS, as said before, all was Ubuntu, but the nvidia card in 99% was Tesla V100, and one was A100, so I’m not sure it’s hardware related.
Most of the machines I used: CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer this installation.
what can possibly go wrong?

8 Likes

If you installed the NVIDIA driver from .run files or bundled driver from CUDA Toolkit, the driver may be lost when you upgrade your Linux kernel. You should reinstall the NVIDIA driver. You can install the driver with dkms option on:

sudo sh NVIDIA-Linux-x86_64-470.xx.xx.run --dkms

–dkms
nvidia-installer can optionally register the NVIDIA kernel module sources, if installed, with DKMS, then build and install a kernel module using the DKMS-registered sources. This will allow the DKMS infrastructure to automatically build a new kernel module when changing kernels. During installation, if DKMS is detected, nvidia-installer will ask the user if they wish to register the module with DKMS; the default response is ‘no’. This option will bypass the detection of DKMS, and cause the installer to attempt a DKMS-based installation regardless of whether DKMS is present.


Simpler solution (Ubuntu only): install NVIDIA driver from PPA:

  1. Uninstall the NVIDIA drivers installed from .run files or bundled driver from CUDA Toolkit

  2. Add PPA graphics-drivers:

    sudo add-apt-repository ppa:graphics-drivers/ppa --yes
    sudo apt update
    
  3. Install NVIDIA driver from PPA:

    sudo apt install nvidia-driver-470  # or nvidia-driver-495
    
  4. (Optional) Mark the driver as hold to prevent auto-upgrading (since it is a server):

    dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | \
        grep -E 'nvidia.*-[0-9]+$' | \
        xargs -r -L 1 sudo apt-mark hold
    

The driver will be persisted when you change your Linux kernel.

6 Likes

OK, so 2 servers was installed via package manager and not from ‘run’ file, so I tried the e simpler solution:
Here is the result:

$ sudo apt install nvidia-driver-470
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-470 : Depends: libnvidia-gl-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-extra-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-decode-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-encode-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: xserver-xorg-video-nvidia-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-cfg1-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-ifr1-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Recommends: libnvidia-decode-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-encode-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-ifr1-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-fbc1-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-gl-470:i386 (= 470.86-0ubuntu0.18.04.1)
E: Unable to correct problems, you have held broken packages.

OK, 2 of my servers were installed using package manager, all the others resolved with re-install.
Those 2 return this:

user@server:~$ sudo apt install nvidia-driver-495
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-495 : Depends: libnvidia-gl-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-extra-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-decode-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-encode-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: xserver-xorg-video-nvidia-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-cfg1-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-fbc1-495 (= 495.44-0ubuntu0.18.04.1) but it is not going to be installed
                     Recommends: libnvidia-decode-495:i386 (= 495.44-0ubuntu0.18.04.1)
                     Recommends: libnvidia-encode-495:i386 (= 495.44-0ubuntu0.18.04.1)
                     Recommends: libnvidia-fbc1-495:i386 (= 495.44-0ubuntu0.18.04.1)
                     Recommends: libnvidia-gl-495:i386 (= 495.44-0ubuntu0.18.04.1)
E: Unable to correct problems, you have held broken packages.
user@server:~$ sudo apt install nvidia-driver-470
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-470 : Depends: libnvidia-gl-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-extra-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-decode-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-encode-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: xserver-xorg-video-nvidia-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-cfg1-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Depends: libnvidia-ifr1-470 (= 470.86-0ubuntu0.18.04.1) but it is not going to be installed
                     Recommends: libnvidia-decode-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-encode-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-ifr1-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-fbc1-470:i386 (= 470.86-0ubuntu0.18.04.1)
                     Recommends: libnvidia-gl-470:i386 (= 470.86-0ubuntu0.18.04.1)
E: Unable to correct problems, you have held broken packages.

Please run:

sudo apt install --fix-broken
sudo dpkg --configure -a
sudo apt install --fix-broken

to fix issue:

E: Unable to correct problems, you have held broken packages.

References:

2 Likes

Even with those commands, the issue wasn’t solved.
Eventually, the fastest way to fix 2 machines with a package manager is to purge all Nvidia & Cuda,did it by:

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'

Then after it’s clean ran that:
sudo apt-get install linux-headers-$(uname -r)

From here - it’s the same for all VMs:
Download latest run file from Nvidia site, and run it, accept if needed to upgrade current, or install from scratch.
The driver is back to work.

The issue was started after did some updates, and the Linux kernel was changed.

9 Likes

You should either install the driver from the package manager or the .run file, not both.

If you install the driver from the .run file, you should uninstall the driver from the package manager (if any), vice versa.

BTW, If your server has a network connection, I still suggest you install the driver from APT instead of .run file (driver run file or CUDA run file). Since it is much easier to upgrade your driver for future maintenance, you will not need to visit the driver download page to download the driver .run file again.

Upgrade your driver:

# Mark unhold
dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | \
    grep -E 'nvidia.*-[0-9]+$' | \
    xargs -r -L 1 sudo apt-mark unhold

# Upgrade driver
sudo modprobe -r -f $(lsmod | grep '^nvidia' | awk '{ print $1 }')
sudo apt update && sudo apt upgrade
nvidia-smi

# Mark hold again
dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | \
    grep -E 'nvidia.*-[0-9]+$' | \
    xargs -r -L 1 sudo apt-mark hold

I didn’t say I used both, I said some of the servers were installed from the run file, and some from the package manager, not both.
Some firewall rules force me to use 3rd package manager to download packages, that is what makes broken packages.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.