I work with a machine that has run CUDA 9.0 (and previous version of CUDA down to 7.0) comfortably without error, the specs are posted below. I recently attempted to install CUDA 10.0, but hit some installation errors. After reading other support topics that suggested purging my system of NVIDIA-packages and doing a fresh re-install via this documentation (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html) and rebooting, I am hitting the following error when attempting to use nvidia-smi:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I attempted to start the driver using
sudo modprobe nvidia
but received the following error message:
modprobe: ERROR: could not insert 'nvidia_418': Package not installed
This is confusing to me, as that driver should have just been installed with my fresh CUDA 10.0 installation, which was managed via the RPM installer. Using:
dpkg -l | grep nvidia
ii nvidia-418 418.56-0ubuntu0~gpu14.04.1 amd64 NVIDIA binary driver - version 418.56
ii nvidia-418-dev 418.56-0ubuntu0~gpu14.04.1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-modprobe 418.40.04-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-418 418.56-0ubuntu0~gpu14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 418.56-0ubuntu0~gpu14.04.1 amd64 Tool for configuring the NVIDIA graphics driver
which shows the driver that I supposedly don’t have installed. Attempting to install this driver via apt-get also states that it is already installed.
The results of
lsmod | grep nvidia
returns nothing in this case, which may be why my installation can’t locate my drivers. But I’m not sure how to install them correctly if that is the case…
Does anyone know what the next step is at this point? I have already tried uninstalling and reinstalling multiple times now, and reboot each time. Any help would be appreciated.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.6 LTS
Can you try removing all of these packages and using the stand-alone executable in the download page?
I never have problems when I use it.
I have uninstalled these packages, uninstalled Cuda 10.0 via apt-get, and installed via the run-file. The run-file installed successfully, but I am encountering the same problem. The difference then is that “dpkg -l | grep nvidia” returns empty.
I have uninstalled that runfile, and reinstalled 10.0 via the deb package for my distribution as originally done. I am now facing the same errors as in the original post.
See what driver version is required for that CUDA version (check the installation guide, on requirements), and download it from apt-get. You will probably have to add a PPA server to your repositories to get this driver. Do a search on how to do that for the distro you are using.
The reason I am suggesting this approach is that, as you and a lot of other people noticed, one update can break the dependencies and rolling back is time consuming. I have a Windows, a Linux and a Mac all setup with different cards and CUDA versions, I develop on the lower-versioned machine and compile/run on all of them. I can’t afford spending time fixing my dev environment broken by a stupid update, so I do it the easiest way: not letting the package manager handle CUDA related stuff, just the driver.
First off, thank you for your help on my problem. Unfortunately, I’m met with the same error when I uninstall and reinstall the drivers. I’ve purged *cuda *nvidia products several times from my system and re-installed from both deb and runfile, but to no avail yet… Seems like something might be surviving past the uninstalls.
First, on a console run “nvcc” and see if it responds. If it does, you still have CUDA stuff installed and need to uninstall properly.
Then, run “nvidia-settings” and see if you got a driver installed and what version. In case you have nvidia driver installed and it covers the CUDA version you intend to use, leave the thing alone. But in case it is not installed, download from here:
Install, reboot and make sure it is working. Don’t install CUDA just yet, get the driver running first and let us know.
I encountered the same error message when installing cuda-drivers for cuda 10 on centos7, when I had been able to install and use cuda 9 drivers successfully.
I tracked it down to the fact that in the cuda 10 drivers, none of the rpm’s has a dependecy built in for the “kernel-devel” package, wheras in cuda 9 (and earlier?) versions, the nvidia-kmod rpm had such a dependency built in. So when installing the cuda 10 drivers, it didn’t automatically install “kernel-devel”, and so the driver didn’t load. I tried again, first installing the kernel-devel package explicitly, and then the installation of cuda-drivers for version 10 worked fine.
It would be nice if nvidia developers could add the “kernel-devel” dependency into one of the cuda 10 rpm’s, perhaps the “dkms-nvidia” package rpm.