In order to use tensorflow >=1.5.0, I tried to upgrade my Ubuntu 16.04 server with 2 GTX 1070 GPUs from Cuda 8 to Cuda 9.
On my first attempt, I used the local .deb installer for 9.1 but after installation, when I tried nvidia-smi it complained:
“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”.
I tried to download and reinstall the driver manually at this stage, but the .run file aborted saying the preinstall failed.
Today, I returned to the problem - this time installing 9.2 (using network installer) after explicitly removing the old cuda 8 install (which I realised I’d forgotten to do first time) and of course the attempted 9.1 install. However the same problem occurs - nvidia-smi reports the same problem.
Note that ‘lspci | grep -i nvidia’ confirms the GPUs are there. Also I get the following error after ‘sudo /sbin/modprobe nvidia’:
“modprobe: ERROR: could not insert ‘nvidia_396’: Exec format error”
I haven’t tried reinstalling the driver - is that what I need to do next?
you may need to follow the driver uninstall procedures from the documentation
Thanks for the reply, however /usr/bin/nvidia-uninstall does not exist.
Use the following command to uninstall a Toolkit runfile installation:
$ sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl
/usr/bin/nvidia-uninstall will not exist for a network deb install. Likewise the previously mentioned perl script for toolkit uninstall will also not exist for a network deb install. This is already indicated in the supplied instructions.
:) thank you for the explanation.
then the only below execution will be available to remove the previously installed package, as it seems to me:
sudo apt-get --purge remove <package_name> # Ubuntu
I had already purged the cuda-8.0 (and botched cuda-9.1), before installing 9.2, and ‘sudo apt list --installed | grep nvidia’ shows the following:
nvidia-396/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-396-dev/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-opencl-icd-396/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
nvidia-prime/xenial,now 0.8.2 amd64 [installed,automatic]
nvidia-settings/unknown,now 396.26-0ubuntu1 amd64 [installed,automatic]
It looks to me like the old drivers aren’t there but modprobe doesn’t like the new drivers:
$ sudo /sbin/modprobe nvidia
modprobe: ERROR: could not insert 'nvidia_396': Exec format error
Any idea as to the most likely reason for this message?
may be you can approach a local [run file] installation with different outcomes?
I managed to fix my problem.
It turns out that although I’d ensured gcc 5.4.0 was installed, the system was still defaulting to the gcc 4.9.3.
I fixed that via:
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 1
Then after purging and reinstalling cuda, everything worked until I realised I needed cuda 9.0 for tensorflow 1.5.0+ not 9.2. Cue another purge and install of 9.0 and I then got my tensorflow setup working with the latest tensorflow_gpu version.