Cuda 10.0 install claims missing driver, but it is installed.

abeers · March 31, 2019, 6:34pm

Hello.

I work with a machine that has run CUDA 9.0 (and previous version of CUDA down to 7.0) comfortably without error, the specs are posted below. I recently attempted to install CUDA 10.0, but hit some installation errors. After reading other support topics that suggested purging my system of NVIDIA-packages and doing a fresh re-install via this documentation (Installation Guide Linux :: CUDA Toolkit Documentation) and rebooting, I am hitting the following error when attempting to use nvidia-smi:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I attempted to start the driver using

sudo modprobe nvidia

but received the following error message:

modprobe: ERROR: could not insert 'nvidia_418': Package not installed

This is confusing to me, as that driver should have just been installed with my fresh CUDA 10.0 installation, which was managed via the RPM installer. Using:

dpkg -l | grep nvidia

I get:

ii  nvidia-418                                                  418.56-0ubuntu0~gpu14.04.1                           amd64        NVIDIA binary driver - version 418.56
ii  nvidia-418-dev                                              418.56-0ubuntu0~gpu14.04.1                           amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                             418.40.04-0ubuntu1                                   amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-418                                       418.56-0ubuntu0~gpu14.04.1                           amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                                0.6.2.1                                              amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                             418.56-0ubuntu0~gpu14.04.1                           amd64        Tool for configuring the NVIDIA graphics driver

which shows the driver that I supposedly don’t have installed. Attempting to install this driver via apt-get also states that it is already installed.

The results of

lsmod | grep nvidia

returns nothing in this case, which may be why my installation can’t locate my drivers. But I’m not sure how to install them correctly if that is the case…

Does anyone know what the next step is at this point? I have already tried uninstalling and reinstalling multiple times now, and reboot each time. Any help would be appreciated.

Machine specifications:

Distributor ID: Ubuntu
Description:    Ubuntu 14.04.6 LTS
Release:        14.04
Codename:       trusty

saulocpp · April 1, 2019, 11:55am

Can you try removing all of these packages and using the stand-alone executable in the download page?
I never have problems when I use it.

abeers · April 1, 2019, 3:47pm

I have uninstalled these packages, uninstalled Cuda 10.0 via apt-get, and installed via the run-file. The run-file installed successfully, but I am encountering the same problem. The difference then is that “dpkg -l | grep nvidia” returns empty.

I have uninstalled that runfile, and reinstalled 10.0 via the deb package for my distribution as originally done. I am now facing the same errors as in the original post.

saulocpp · April 1, 2019, 5:03pm

See what driver version is required for that CUDA version (check the installation guide, on requirements), and download it from apt-get. You will probably have to add a PPA server to your repositories to get this driver. Do a search on how to do that for the distro you are using.

The reason I am suggesting this approach is that, as you and a lot of other people noticed, one update can break the dependencies and rolling back is time consuming. I have a Windows, a Linux and a Mac all setup with different cards and CUDA versions, I develop on the lower-versioned machine and compile/run on all of them. I can’t afford spending time fixing my dev environment broken by a stupid update, so I do it the easiest way: not letting the package manager handle CUDA related stuff, just the driver.

abeers · April 1, 2019, 10:00pm

First off, thank you for your help on my problem. Unfortunately, I’m met with the same error when I uninstall and reinstall the drivers. I’ve purged *cuda *nvidia products several times from my system and re-installed from both deb and runfile, but to no avail yet… Seems like something might be surviving past the uninstalls.

saulocpp · April 2, 2019, 7:31am

First, on a console run “nvcc” and see if it responds. If it does, you still have CUDA stuff installed and need to uninstall properly.
Then, run “nvidia-settings” and see if you got a driver installed and what version. In case you have nvidia driver installed and it covers the CUDA version you intend to use, leave the thing alone. But in case it is not installed, download from here:
https://www.nvidia.com/object/unix.html

Install, reboot and make sure it is working. Don’t install CUDA just yet, get the driver running first and let us know.

dknodel · May 24, 2019, 8:14am

Hi,
I encountered the same error message when installing cuda-drivers for cuda 10 on centos7, when I had been able to install and use cuda 9 drivers successfully.
I tracked it down to the fact that in the cuda 10 drivers, none of the rpm’s has a dependecy built in for the “kernel-devel” package, wheras in cuda 9 (and earlier?) versions, the nvidia-kmod rpm had such a dependency built in. So when installing the cuda 10 drivers, it didn’t automatically install “kernel-devel”, and so the driver didn’t load. I tried again, first installing the kernel-devel package explicitly, and then the installation of cuda-drivers for version 10 worked fine.

It would be nice if nvidia developers could add the “kernel-devel” dependency into one of the cuda 10 rpm’s, perhaps the “dkms-nvidia” package rpm.

Regards,

dave