Restart after a power outage: NVIDIA-SMI has failed

When my machine restarted after a power outage nvidia-smi does not work and it shows “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”
I checked the forum posts and did the following.

The outputs of grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* shows
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer

I have attached the log obtained after running sudo nvidia-bug-report.sh

The machine is a remote server and I use only ssh to connect to it. There is no monitor attached to the machine.

Can someone please help. I can send more relevant info if needed.

Thanks in advance.nvidia-bug-report.log (431.8 KB)

You initially installed the driver using the .run installer but didn’t enable dkms. Meanwhile, you got a kernel upgrade for which the driver wasn’t compiled because of this. Please reinstall the driver enabling dkms.

Thank you for the reply. I checked if dkms is available in my machine. I tried “sudo apt-get install dkms” but the message is “dkms is already the newest version (2.3-3ubuntu9.7)”.

I uninstalled the previous driver using the instructions
sudo /usr/local/cuda-10.2/bin/cuda-uninstaller
and
sudo /usr/bin/nvidia-uninstall
Then downloaded the latest cuda toolkit 11.0 from nvidia (containing driver version 450.36.06) and did a fresh install.

I did not know how to enable dkms when installing the driver. The installer did not ask for any such options. I did find that at least for driver version 415.xx.xx there seems to be an explicit dkms registration step during the installation. That did not happen in my case.

Now things are hopefully fine until at least the next power outage.
Thanks again for your help.

What does dkms status tell you?

added

means the drivers source has been added.

built

means the module has been built.

installed

means the previous 2 have been applied and the module has been installed (that’s what you’d like to see).

sudo dkms autoinstall -m nvidia -v 450.36.06 -k $(uname -r) would do the last 2 steps.

dkms status prints nothing.

I am confused now. Should I do " sudo dkms autoinstall -m nvidia -v 450.36.06 -k $(uname -r)" and go for a fresh install again? Please let me know.
Thanks.

Then the cuda installer didn’t use dkms. On install, there should be a question whether to install the driver and use dkms. If not, you could try to install cuda using --dkms option. At least his is the switch for the stand-alone driver.

If things are running now, it should survive the power outage, but not the next kernel upgrade.

If you want dkms (to automatically install along with a new kernel) yes… as generix just said.

I did try “sudo sh cuda_11.0.1_450.36.06_linux.run --dkms” option during installation. I got the reply as “Unknown option: --dkms”.
Then I went ahead without that option and simply tried “sudo sh cuda_11.0.1_450.36.06_linux.run” and things went fine.

No question was asked about dkms during the install.

I got the following log after installation:

$ sudo sh cuda_11.0.1_450.36.06_linux.run

= Summary =

Driver: Installed
Toolkit: Installed in /usr/local/cuda-11.0/
Samples: Installed in /home/xxx/, but missing recommended libraries

Please make sure that

  • PATH includes /usr/local/cuda-11.0/bin
  • LD_LIBRARY_PATH includes /usr/local/cuda-11.0/lib64, or, add /usr/local/cuda-11.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-11.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-11.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /var/log/cuda-installer.log

Thanks for the rapid replies.

You could extract the cuda runfile, e.g.
./cuda_11.0.1_450.36.06_linux.run --extract=/home/username/Downloads/cuda --tmpdir=/home/username/Downloads/cuda.tmp
(takes very long)
inside the directory there’s then the .run file driver installer that you can use with --dkms

Sure. I will try this. Could you please tell if i need to uninstall the previously installed driver using “/usr/bin/nvidia-uninstall” or some other command.
I plan to do the following:

  1. Uninstall the previously installed driver (after your confirmation). Otherwise I will let the fresh install overwrite the existing one.
  2. Do sudo dkms autoinstall -m nvidia -v 450.36.06 -k $(uname -r) as suggested by Mart.
  3. Extract the driver installation component into a local dir as you suggested and then try to install just the driver using ./NVIDIA-Linux-x86_64xxxxxxx.run --dkms .
  4. Reboot after install.
    I will let you know.
    Thanks a lot.

The .run installer autmatically uninstalls any previous driver. So no need to do this explicitly.
The second step is pointless without a driver installed, don’t do it.
after reboot, you can check the driver:
dkms status

During installation, if DKMS is detected, nvidia-installer will ask the user if they wish to register the module with DKMS; the default response is ‘no’. This option will bypass the detection of DKMS, and cause the installer to attempt a DKMS-based installation regardless of whether DKMS is present.

This is from the regular driver --help output.
Makes me think if the setup may skip asking for dkms, if it detects you don’t have dkms installed at all. So just to make sure… You got dkms installed?

And sorry for not being clear enough before… of course you need the drivers source added to dkms to (auto)install it.

I tried sudo sh cuda_11.0.1_450.36.06_linux.run --extract=/home/username/Downloads/cuda --tmpdir=/home/username/Downloads/cuda.tmp but got the reply Unable to create temporary file in /home/username/Downloads/cuda.tmp

I tried a couple of other directories but that did not help either. I checked space with free command and it seems I have sufficient memory capacity available.

So I decided to go without --tmpdir option. So i just tried

sudo sh cuda_11.0.1_450.36.06_linux.run --extract=/home/username/Downloads/cuda

This worked fine and ls -l /home/username/Downloads/cuda/NVIDIA-Linux-x86_64-450.36.06.run shows the file to be present.

Then i tried
sudo ./NVIDIA-Linux-x86_64-450.36.06.run --dkms

This gave two errors one after the other:

ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading. Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

and

ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details.

As suggested by generix in another thread, i did the following:

  1. I did ps a |grep X
    It shows an active X server session. I killed it. But it seems to restart a new session again.

  2. I tried sudo service lightdm stop. After this if i typed ps a |grep X it shows some X server session.

  3. I tried sudo service gdm stop. This time the X server session has ended. So gdm is the display manager for my machine.

  4. Then i typed sudo service nvidia-persistenced stop but it replied Failed to stop nvidia-persistenced.service: Unit nvidia-persistenced.service not loaded.

  5. Then i typed sudo modprobe -r nvidia. The command is executed without any output.

After this, I went again for installing the driver using sudo ./NVIDIA-Linux-x86_64-450.36.06.run --dkms

Now the prompt There appears to already be a driver installed on your system (version: 450.36.06). As part of installing this driver (version: 450.36.06), the existing driver will be uninstalled. Are you sure you want to continue? appears. I give Continue installation.

Again a prompt The distribution-provided pre-install script failed! Are you sure you want to continue? appears. I give Continue Installation

After this the much awaited prompt Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. appears. I give a Yes

Now it asks Install NVIDIA's 32-bit compatibility libraries? I give a No

It goes for checking already installed files and then at some stage it displays Installing DKMS kernel module: It takes some time to build this.

Then a question prompts Would you like to run the nvidia-xconfig utility to automatically update your X configuration file so that the NVIDIA X driver will be used when you restart X? Any pre-existing X configuration file will be backed up

I give a No response

Finally it says
Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 450.36.06) is now complete. Please update your xorg.conf file as appropriate; see the file /usr/share/doc/NVIDIA_GLX-1.0/README.txt for details.
I give a Ok and the shell prompt reappears.

I go for a reboot

After reboot i type dkms status It displays nvidia, 450.36.06, 4.15.0-101-generic, x86_64: installed

nvidia-smi also displays the proper nvidia driver version.

I think dkms has been successfully registered now.

Thanks generix and Mart for the help.

I rechecked if I have dkms installed before reinstalling the driver. I have it installed. I checked this using the sudo apt-get install dkms command and it shows that dkms is already installed.

I did not get prompted for dkms because i was doing sudo sh cuda_11.0.1_450.36.06_linux.run

But as described in my previous reply when i tried reinstalling just the driver using sudo ./NVIDIA-Linux-x86_64-450.36.06.run --dkms i get such a prompt Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. for which i gave a Yes.

Hope that now dkms is registered with the driver.
Thanks for clarifying again and thanks for the help.

Personally I do (don’t know if that’s the recommended way, but it sure ensures no X and nvidia related process is running):
sudo systemctl isolate multi-user.target

or boot into rescue prompt to install the .run driver.