CUDA (10.1) seen for Ubuntu 20.04/GeForce GTX/470 driver *until* reboot (RESOLVED)

After upgrading my Ubuntu OS from 18.04 to 20.04, I found out the CUDA 11.6 requires the 510 driver or higher, but problems with the 510 driver requires me to use the nvidia 470 driver instead. So I purged the CUDA installation, and used Pytorch 1.11 to install CUDA, installed the 470 driver, and success, torch.cuda.is_available() reported “True”. Hurray!
Until I rebooted. :( Not only does torch.cuda.is_available report “False”, the nvidia-smi command gives the dreaded error:
“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”
I have searched this forum and generally online but I am baffled. Please help! Below are the commands I ran for my initial success then crushing failure. Happy to provide more info too for debugging.

sudo rm /etc/apt/sources.list.d/cuda*
sudo apt remove --autoremove nvidia-cuda-toolkit
sudo apt remove --autoremove nvidia-*
sudo rm -rf /usr/local/cuda*
sudo apt-get purge nvidia*
sudo apt-get update
sudo apt-get autoremove
sudo apt-get autoclean

pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
sudo apt install nvidia-utils-470-server
nvidia-smi <— NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
(highest CUDA version the installed driver supports )
sudo apt install nvidia-cuda-toolkit
nvcc --version <— Cuda compilation tools, release 10.1, V10.1.243

kate@kate-gamer:~$ python
Python 3.8.1 (default, Sep 9 2020, 19:38:36)
[GCC 7.5.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.

import torch
torch.cuda.is_available()
True <---- WOO HOO!
torch. version
‘1.7.1’

==> After rebooting, sadly torch.cuda.is_available() is False.

kate@kate-gamer:~$ nvida-smi

Command ‘nvida-smi’ not found, did you mean:

command 'nvidia-smi' from deb nvidia-utils-435 (435.21-0ubuntu7)
command 'nvidia-smi' from deb nvidia-utils-440 (440.82+really.440.64-0ubuntu6)
command 'nvidia-smi' from deb nvidia-340 (340.108-0ubuntu5.20.04.2)
command 'nvidia-smi' from deb nvidia-utils-390 (390.147-0ubuntu0.20.04.1)
command 'nvidia-smi' from deb nvidia-utils-450-server (450.172.01-0ubuntu0.20.04.1)
command 'nvidia-smi' from deb nvidia-utils-470 (470.103.01-0ubuntu0.20.04.1)
command 'nvidia-smi' from deb nvidia-utils-470-server (470.103.01-0ubuntu0.20.04.1)
command 'nvidia-smi' from deb nvidia-utils-510 (510.60.02-0ubuntu0.20.04.2)
command 'nvidia-smi' from deb nvidia-utils-510-server (510.47.03-0ubuntu0.20.04.1)
command 'nvidia-smi' from deb nvidia-utils-418-server (418.226.00-0ubuntu0.20.04.2)

Try: sudo apt install

sudo apt install nvidia-utils-470-server

kate@kate-gamer:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

==> Odd, just after the reboot the “ubuntu-drivers devices” command seemed to indicate that the 470 driver was recommended, see output below:

kate@kate-gamer:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

kate@kate-gamer:~$ ubuntu-drivers devices
WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-510: package has invalid Support PBheader, cannot determine support level
WARNING:root:_pkg_get_support nvidia-driver-510-server: package has invalid Support PBheader, cannot determine support level
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C20sv00001028sd00000802bc03sc00i00
vendor : NVIDIA Corporation
model : GP106M [GeForce GTX 1060 Mobile]
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-470 - distro non-free recommended
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

=> But today running the “ubuntu-drivers devices” commands NOW indicates that the 510 driver is recommended…

kate@kate-gamer:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C20sv00001028sd00000802bc03sc00i00
vendor : NVIDIA Corporation
model : GP106M [GeForce GTX 1060 Mobile]
driver : nvidia-driver-510 - distro non-free recommended
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

Maybe the 510 driver issues have been resolved???
In the interim I have run
sudo apt update
sudo apt-get install linux-headers-$(uname -r)

Would trying again with the 510 driver be advised? May give that a shot next (again)
Tuesday May 10, 2022. Going to go with the 510 driver (again). Had almost forgotten Ubuntu has a GUI interface to handle drivers, can do this by opening “Software and Updates”. I did that, from the list of drivers selected the Nvidia 510 driver. Took a while for the changes to apply. I then rebooted.

First excellent sign - my attached second monitor is finally displaying again!! I then checked for CUDA to be recognized by Pytorch, and torch.cuda.is_available() output was “True”.

My ordeal is over at last - hope these notes can help others.

1 Like