Kernel Update breaks CUDA/nvidia-smi

I have Ubuntu 16.04, Cuda 10.0 (installed by local .deb), driver 410.48 (installed automatically during cuda install). Booting into a recent update kernel 4.4.0-145-generic and running nvidia-smi gives “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver”. Things are fine if I boot into the older 4.4.0-142-generic.

Do I need to purge and reinstall CUDA to work with the new kernel?
Is this expected behaviour or a misconfiguration or a bug?

Yes, a kernel update will typically break a CUDA install, unless you have DKMS properly set up and in use. That is expected behavior.

You need to reinstall CUDA to work with the new kernel.

A purge should only be necessary if you are switching CUDA versions.

I won’t be able to give a tutorial on using DKMS, its not a NVIDIA product, but there are various writeups on the web.

ok, thanks

Hi,

I know that you don’t want to give me a tutorial on DKMS but I wonder if you might be able to give me any suggestions as to how to proceed, in general. The DKMS does appear to have been set up by the .deb installation, in the way that it has with every other cuda install I have done. (see diagnostic information below)

I have tried (all with sudo):
apt-get --purge remove cuda
reboot
apt update && apt install cuda
reboot [no difference]
dpkg --remove cuda
dpkg --remove cuda-10-0
dpkg --install cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64.deb
apt update && apt install cuda
reboot [No difference]

The docs [https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation] seem to suggest that simply running apt-get remove purge <package_name> should be sufficient to uninstall cuda but I think that I was just removing and installing some tiny meta-packages cuda and cuda-10-0 - there were no errors or warnings but nothing was built. There are about 20 packages listed in dpkg by cuda-* maybe i should --remove each one and then rm everything under /usr/local/cuda and /usr/local/cuda-10-0 ?

$ dkms status
bbswitch, 0.8, 4.4.0-142-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-145-generic, x86_64: installed
nvidia-410, 410.48, 4.4.0-142-generic, x86_64: installed

$ ls -alR /var/lib/dkms
-- very long output abridged --
/var/lib/dkms/nvidia-410/410.48/build:
total 228
drwxr-xr-x 10 root root   4096 Apr  4 16:54 .
drwxr-xr-x  4 root root   4096 Apr  4 16:54 ..
drwxr-xr-x  3 root root   4096 Apr  4 16:54 common
drwxr-xr-x  3 root root   4096 Apr  4 16:54 conftest
-rw-r--r--  1 root root   1209 Apr  4 16:54 conftest25991.c
-rwxr-xr-x  1 root root 135214 Apr  4 16:54 conftest.sh
-rw-r--r--  1 root root   1197 Apr  4 16:54 dkms.conf
-rw-r--r--  1 root root   6153 Apr  4 16:54 Kbuild
-rw-r--r--  1 root root   4547 Apr  4 16:54 Makefile
-rw-r--r--  1 root root  10295 Apr  4 16:54 make.log
-rw-r--r--  1 root root    236 Apr  4 16:54 modules.order
-rw-r--r--  1 root root     83 Apr  4 16:54 nv_compiler.h
drwxr-xr-x  2 root root   4096 Apr  4 16:54 nvidia
drwxr-xr-x  2 root root   4096 Apr  4 16:54 nvidia-drm
drwxr-xr-x  2 root root   4096 Apr  4 16:54 nvidia-modeset
drwxr-xr-x  3 root root  12288 Apr  4 16:54 nvidia-uvm
drwxr-xr-x  2 root root   4096 Apr  4 16:54 patches
drwxr-xr-x  2 root root   4096 Apr  4 16:54 .tmp_versions

nvidia-410 doesn’t look like a driver packaged by nvidia

that looks like something set up for PPA which is not a nvidia source

when I am working with these things, I will often start with a clean install of the OS, and use only nvidia sources

Also, I generally use the runfile installers, not the package manager installers.

I don’t know if that will fix whatever problem you are having. You’re welcome to do as you wish of course

I started with a clean, standard Ubuntu Server 16.04.05 and was careful to document the whole installation process, I did not install any external PPAs or make any changes to any drivers. The exact commands used for cuda installation were:

wget -P ~ https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64
sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1604-10-0-local-10.0.130-410.48_1.0-1_amd64.deb
sudo apt-get update && sudo apt install cuda -y
echo 'export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}' >> ~/.bashrc
sudo systemctl reboot

The uninstall process recommended by the docs [https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation] is “apt-get remove purge <package_name>”
I’m not entirely clear what is meant <package_name> - using “cuda” and “cuda-10-0” doesn’t remove much - does it sound sensible to manually type out all 20 or so cuda* packagenames and then rm -rf /usr/local/cuda* and then try a runfile installation?

Searching stackexchange, askubuntu etc. indicates no consensus on the best way to uninstall a cuda deb install eg. https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one

The <package_name> is intended to refer to whatever package you installed.

If you did

sudo apt-get install cuda

then the <package_name> is cuda

If you did

sudo apt-get install cuda-10-0

then the package name is cuda-10-0

OK,

Having booted into the 145 kernel, I did:
sudo apt-get remove --purge cuda

  • This removed a tiny meta-package about 25KB but left the other cuda packages orphaned.

sudo apt-get autoremove

  • This removed all of the other cuda packages

sudo apt update && sudo apt install cuda

  • this reinstalled and appeared to rebuild cuda

I found that this made no difference to the problem. nvidia-smi still gave the expected output for kernel 142 but not 145. My best guess is that the nvidia driver automatically installed by the .deb install is not compatible with the recent 145 ubuntu-server kernel update.

Do you think it is worth trying to manually install a more recent driver?
Such as this: https://www.nvidia.co.uk/Download/driverResults.aspx/145270/en-uk

That could be. It happens from time to time.

I thought this was about DKMS. I didn’t realize you had tried to directly install the deb packages into the 145 kernel and were having trouble with that. But its clear now.

Obviously DKMS won’t work if the driver itself cannot be installed correctly.

You could probably confirm this by carefully studying the install logs and trying the latest linux driver.

I have not been able to install the new driver. I have done the following:

# get rid of cuda and old nvidia drivers
sudo apt-get remove --purge cuda
sudo apt autoremove
apt-get remove --purge nvidia-410 nvidia-modprobe nvidia-settings -y

Confirm all the cuda pre-installation conditions are satisfied: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions

disable nouveau

put the following:

blacklist nouveau
options nouveau modeset=0

into:

/etc/modprobe.d/blacklist-nouveau.conf

then run:

sudo update-initramfs -u
sudo apt autoremove
sudo reboot
# confirm no nouveau or nvidia drivers running:
lsmod | grep nou
lsmod | grep nvidia
# Try to install latest nvidia driver
wget http://uk.download.nvidia.com/XFree86/Linux-x86_64/418.56/NVIDIA-Linux-x86_64-418.56.run
sudo sh ./NVIDIA-Linux-x86_64-418.56.run

Result:
“The distribution-provided pre-install script failed! Are you sure you want
to continue?”

When I choose to Abort, the installer advises me to look in the install log. There is no additional information information in the install log.

I can think of four possible ways forward:

I could ignore the warning and install anyway.

I could try using a 3rd party ppa:
https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

I could reinstall the OS from scratch and try to make installing the nvidia driver the first thing I do, followed by a runfile install of cuda. I’m very hesistant to do this as it would take a long time to set everything back up again, and configure all the applications etc. etc.

I could just give up and use the older 142 kernel forever.

Which of these seems like the least-bad idea to you?

OK, I went ahead and ignored the installer warning and installed the recent nvidia 418 driver

nvidia-smi now gives the expected output in the 145 kernel so it seems this was successful. The problem now is that according to nvidia-smi, CUDA 10.1 has apparently installed itself with the driver. I didn’t expect this and I specifically want CUDA 10.0 because 10.1 is known to not work well with tensorflow and other libraries yet. If I run

apt install cuda

OR

apt install cuda-10-0

amongst the large number of listed packages, nvidia-410 appears.

I probably don’t want to install multiple versions of the nvidia driver (?) so I guess I need to find a driver that is more recent than is available from the .deb install but less recent than is available from the nvidia website?

Is it possible to remove cuda 10.1, retain the 418 driver and install cuda 10.0 without installing the 410 driver?

No, that is not what happened, and yes it is confusing.

Newer versions of nvidia-smi report the driver compatibility version in the upper right hand corner of the output. So a new enough driver will be compatible with CUDA 10.1 and will report that. It doesn’t mean that CUDA 10.1 is installed - the driver has no way of knowing that, and neither does nvidia-smi.

So this is an acceptable scenario for use of CUDA 10.0

(Even if CUDA 10.1 were installed – it is not installed by the driver runfile installer you used – it is still possible to install CUDA 10.0 “alongside” CUDA 10.1 - but I digress).

After installing the 418.56 driver you can either:

  • use a CUDA 10.0 runfile installer (deselect the option to install the bundled driver)

or

  • use the .deb package method to install. In this case you need to use one of the meta packages:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas

specifically:

sudo apt-get install cuda-toolkit-10-0

should install the CUDA 10.0 toolkit without touching the driver.

I installed cuda 10-0 using the .run and saying “no” to the driver. I now have cuda working on the 145 kernel. Hopefully it will be built against future kernel updates as well? Thanks for support.

I’d like to suggest a small update to the docs:
sudo apt-get remove --purge just removes a small meta-package and leaves all of cuda present on the system, ready and willing to cause conflicts: sudo apt-get remove --purge && sudo apt-get autoremove does do the job.

I install cuda with the network deb and never have a problem with kernel updates.

Yes, as I say, this is something I have done quite a few times without any problems as well, including on several machines that I am currently using / maintaining. It would seem that the problem here was that the kernel update for Ubuntu Server 16.04 wouldn’t work with the older .deb driver.