nvidia driver failing to load after upgrade from Cuda 8 to Cuda 9.2

In order to use tensorflow >=1.5.0, I tried to upgrade my Ubuntu 16.04 server with 2 GTX 1070 GPUs from Cuda 8 to Cuda 9.

On my first attempt, I used the local .deb installer for 9.1 but after installation, when I tried nvidia-smi it complained:

“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”.

I tried to download and reinstall the driver manually at this stage, but the .run file aborted saying the preinstall failed.

Today, I returned to the problem - this time installing 9.2 (using network installer) after explicitly removing the old cuda 8 install (which I realised I’d forgotten to do first time) and of course the attempted 9.1 install. However the same problem occurs - nvidia-smi reports the same problem.

Note that ‘lspci | grep -i nvidia’ confirms the GPUs are there. Also I get the following error after ‘sudo /sbin/modprobe nvidia’:

“modprobe: ERROR: could not insert ‘nvidia_396’: Exec format error”

I haven’t tried reinstalling the driver - is that what I need to do next?

you may need to follow the driver uninstall procedures from the documentation

[url]Installation Guide Linux :: CUDA Toolkit Documentation

Thanks for the reply, however /usr/bin/nvidia-uninstall does not exist.

Use the following command to uninstall a Toolkit runfile installation:

$ sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl

/usr/bin/nvidia-uninstall will not exist for a network deb install. Likewise the previously mentioned perl script for toolkit uninstall will also not exist for a network deb install. This is already indicated in the supplied instructions.

:) thank you for the explanation.

then the only below execution will be available to remove the previously installed package, as it seems to me:

sudo apt-get --purge remove <package_name>          # Ubuntu

I had already purged the cuda-8.0 (and botched cuda-9.1), before installing 9.2, and ‘sudo apt list --installed | grep nvidia’ shows the following:

nvidia-396/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-396-dev/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-opencl-icd-396/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-prime/xenial,now 0.8.2 amd64 [installed,automatic]
nvidia-settings/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]

It looks to me like the old drivers aren’t there but modprobe doesn’t like the new drivers:

$ sudo /sbin/modprobe nvidia
modprobe: ERROR: could not insert 'nvidia_396': Exec format error

Any idea as to the most likely reason for this message?

may be you can approach a local [run file] installation with different outcomes?

I managed to fix my problem.

It turns out that although I’d ensured gcc 5.4.0 was installed, the system was still defaulting to the gcc 4.9.3.

I fixed that via:

sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 1

Then after purging and reinstalling cuda, everything worked until I realised I needed cuda 9.0 for tensorflow 1.5.0+ not 9.2. Cue another purge and install of 9.0 and I then got my tensorflow setup working with the latest tensorflow_gpu version.