Safe to update driver 396.37->54 with CUDA 9.2?

I recently installed CUDA using the deb file cuda-repo-ubuntu1710-9-2-local_9.2.148-1_amd64.deb. At the time, I had already installed the latest driver, and the deb complained about a version conflict. So I removed both the existing driver and the partial installation of CUDA, then reran the deb file and let it install its preferred version of the driver, which was 396.37. CUDA worked, although the graphics was broken. (Any purely computational program in the Samples seemed to pass, but anything that tried to open a window would crash.)

Anyway, my software manager (Discover) keeps asking me if I want to update to 396.54. Is this safe, given that the deb installer complained about my existing driver? In other words, can CUDA 9.2 use the latest driver?

Yes, of course. Unless the driver has a bug.

The update failed, and I found the following in /var/log/apt/term.log:

Preparing to unpack …/12-libnvidia-gl-396_396.54-0ubuntu0~gpu18.04.1_amd64.deb

De-configuring libnvidia-gl-396:i386 (396.51-0ubuntu0~gpu18.04.1) …
dpkg-query: no packages found matching libnvidia-gl-390
Unpacking libnvidia-gl-396:amd64 (396.54-0ubuntu0~gpu18.04.1) over (396.51-0ubuntu0~gpu18.04.1) …
dpkg: error processing archive /tmp/apt-dpkg-install-AFcdAv/12-libnvidia-gl-396_396.54-0ubuntu0~gpu18.04.1_amd64.deb (–unpack):
trying to overwrite ‘/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json’, which is also in package nvidia-396 396.37-0ubuntu1
Preparing to unpack …/13-libnvidia-gl-396_396.54-0ubuntu0~gpu18.04.1_i386.deb …
De-configuring libnvidia-gl-396:amd64 (396.51-0ubuntu0~gpu18.04.1) …
dpkg-query: no packages found matching libnvidia-gl-390

So it seems my system has remnants of version 51, even through nvidia-smi reports that 37 is being used.
I assume 51 is the version I originally installed, and then removed with apt when the CUDA deb installer complained.

But the immediate problem seems to be the conflict involving /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json

So the next question is: how to do this update?

It seems I have two versions installed, 396.37 and 396.51. But they are in two different places. I think 51 came from the Ubuntu repositories while 37 came directly from nVIDIA. In other words, 51 was installed the “Ubuntu way” while 37 was installed the “nVIDIA way.” In /var/cache/apt/archives I only have filenames with 396.51 and 54, no 37. I assume the 54’s are from when I tried “sudo apt-get upgrade” and it got partway before aborting.

As I said in my OP, I first tried installing with apt-get. I think I ended up with 51. Then I ran the CUDA installation deb, and it seems to have installed 37 on its own.

The following are in /usr/lib/i386-linux-gnu:

/usr/lib/i386-linux-gnu$ ls *vid*
libEGL_nvidia.so.0             libnvidia-fbc.so.396.54
libEGL_nvidia.so.396.51        libnvidia-glcore.so.396.51
libGLESv1_CM_nvidia.so.1       libnvidia-glsi.so.396.51
libGLESv1_CM_nvidia.so.396.51  libnvidia-glvkspirv.so.396.51
libGLESv2_nvidia.so.2          libnvidia-ifr.so
libGLESv2_nvidia.so.396.51     libnvidia-ifr.so.1
libGLX_nvidia.so.0             libnvidia-ifr.so.396.54
libGLX_nvidia.so.396.51        libnvidia-opencl.so.1
libnvidia-eglcore.so.396.51    libnvidia-opencl.so.396.37
libnvidia-fbc.so               libnvidia-tls.so.396.51
libnvidia-fbc.so.1

nvidia:
xorg

As can be seen, it’s mostly 51, but there are a few pieces of 37 and 54.

Various files with the substring 396.37 are located in:

/var/lib/dkms/nvidia-396/396.37
/var/cuda-repo-9-2-local/
/usr/share/nvidia/
/usr/share/nvidia-396/
/usr/lib/x86_64-linux-gnu/libcuda.so.396.37
/usr/lib/i386-linux-gnu/
/usr/lib/nvidia-396/
/usr/src/
/usr/lib32/

All the actual ,ko files appear to be 37:

$ sudo find / -name '*vidi*ko' | xargs -l modinfo | grep '^version'
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37
version:        396.37

And nvidia-smi says it’s running 37.

So when I tried to upgrade with apt-get, it got confused because it was trying to update the 396.51 files rather than the more numerous (and actually installed) 396.37 files.

So my questions are:

  1. Can I just ignore the version 51 files, or should I get rid of them? (And what is the clean way of removing them without damaging the CUDA installation?)

  2. What is the nVIDIA-centric way of keeping its driver up to date without breaking the CUDA installation?

It’s a bit odd that the ubuntu driver didn’t just replace the driver that came with cuda. At least this worked previously with ubuntu 16.04.
Nevertheless, the better way to install this is to install the driver from the ubuntu repositories or the graphics ppa, then use the .deb from nvidia’s site and install cuda-toolkit, not cuda.

You mean do all the usual steps for installing from deb, but finish with

sudo apt-get install cuda-toolkit

rather than

sudo apt-get install cuda

So what’s the difference between the two?

Exactly.
‘cuda’ is a meta package which installs

  • driver
  • cuda-toolkit
  • cuda-samples
    so if you want to install the driver from another repository, you shouldn’t use it.

Just for info, I wondered why the direct driver upgrade failed. Looks like you were using the 17.10 cuda package on an 18.04 system. Driver packaging changed completely from 17.10 to 18.04 so you got a mixed-up install after installing the distro driver. I think you should uninstall both first and start anew.

Yes, I see in /var/lib/apt/lists/_var_cuda-repo-9-2-local_Packages that for cuda it says:

Description: CUDA meta-package
Meta-package containing all the available packages required for native CUDA
development. Contains the toolkit, samples, driver and documentation.

while for cuda-toolkit it says:

Description: CUDA Toolkit 9.2 meta-package
Meta-package containing all the available toolkit packages related to native
CUDA development. Contains the toolkit, samples, and documentation.
Locked at CUDA Toolkit version 9.2.

So the file implies that cuda == cuda-toolkit + drivers

And I followed some of the dependencies, and it sort of looks like that is the case.

But will cuda-toolkit provide everything else (besides the driver) needed to use the CUDA functionality? Or do I need a separate driver? Are all the kernel “hooks” for accessing the CUDA functions in the GPU already built into the 396 driver? (That is, there is only one driver whether or not one installs CUDA; there aren’t separate CUDA-capable and graphics-only drivers?)

BTW don’t I need to specify cuda-toolkit-9-2 because there is no cuda-toolkit package in the list? Or will apt-get look for unique completions?

Yes, you might be right there, since it’s possible to install different cuda versions side-by-side so it’s probably useful to specify the version.

So I did the following:

  1. Got rid of both CUDA and whatever nvidia drivers were there:

sudo apt-get --purge remove cuda
sudo apt-get remove --purge nvidia-*
sudo apt autoremove

  1. Made sure any blacklists in /etc/modprobe.d that blacklisted nouveau were gone, so that I could reboot into nouveau. Then ran

update-initramfs -u

  1. Rebooted. Verified the system was back to using nouveau and not nvidia (lsmod | egrep ‘(nouv|nvid)’). Then ran

ubuntu-drivers devices

to check that it was still recommending nvidia-396 (it did).

  1. Installed the latest nvidia driver from the Ubuntu repo:

sudo ubuntu-drivers autoinstall

(This took about half an hour because my apt cache still had 37 sitting around and it had to get 54 from the repo.)

  1. Rebooted. Verified the system was using nvidia and not nouveau. Ran glmark2 and got good scores (unlike before when I had the 37 and 51 on at the same time).

  2. Finally, ran

sudo apt-get install cuda-toolkit-9-2

This didn’t take too long as the deb installation was still in my cache.

  1. Tested by running NVIDIA_CUDA-9.2_Samples/5_Simulations/nbody/nbody. (I had compiled this before but it would crash as soon as the graphics window appeared.) Now it works as expected.

So the trick was to install cuda-toolkit-9-2 rather than cuda.

If any nVIDIA manual writers are reading this:
It would be great if you could mention this in the installation guide, as it would have saved me a lot of headaches.

For instance, in step (5) of section 3.6 of the installation guide at https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html, where it says

Install CUDA

$ sudo apt-get install cuda

Change it to

Install CUDA

If the nVIDIA driver has already been installed on the system, run

$ sudo apt-get install cuda-toolkit-9-2

otherwise,

$ sudo apt-get install cuda

I was already at 18.04 when I started thinking about CUDA (want to run Tensorflow and other DL tools) and didn’t want to revert to a non-LTS version. I did see some webpages claiming that CUDA could be installed on 18.04 but the part about installing the tookkit separate from the driver was buried in the details and not emphasized.

There’s also a detailed manual: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas