"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on Ubuntu 17.10

Hi there,

I have Ubuntu 17.10. I am trying to get nvidia-smi to work. But I am getting the error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.

What I tried to fix this:

  • I tried to install gpu drivers: nvidia-384, nvidia-387, nvidia-390, nvidia-396. I checked each, but none worked for me.
  • I installed cuda-9.0 for ubuntu 16.04 via deb package install. Didn't work.
  • I installed cuda-9.0 for ubuntu 16.04 via run installer. Didn't work.
  • I installed cuda-9.0 for ubuntu 17.04 via deb package install. Didn't work.
  • I installed cuda-9.2 for ubuntu 17.10 via deb package install. Didn't work.
  • I installed cuda-9.2 for ubuntu 17.10 via run installer. Didn't work.
  • I set LD_LIBRARY_PATH and PATH accordingly.
  • I rebooted each time. ``` lsmod | grep nvidia ``` always stayed blank.

What binaries am I supposed to choose if I am having ubuntu17.10 and the binaries provided contain ubuntu16.04 and ubuntu17.04? I would like to get it work in ubuntu17.10 without up or downgrading.

Sometimes the cuda install removed the nvidia driver I previously installed and reinstalled a different nvidia driver. nvidia-smi still didn’t work.
I removed previously installed nvidia drivers via

sudo apt purge nvidia-*

I removed cuda via

sudo rm -r /usr/local/cuda*

I tried

sudo apt purge cuda*

which generated the following output

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package cuda_9.0.176_384.81_linux-run
E: Couldn't find any package by glob 'cuda_9.0.176_384.81_linux-run'
E: Couldn't find any package by regex 'cuda_9.0.176_384.81_linux-run'
E: Unable to locate package cuda_9.2.88_396.26_linux
E: Couldn't find any package by glob 'cuda_9.2.88_396.26_linux'
E: Couldn't find any package by regex 'cuda_9.2.88_396.26_linux'
E: Unable to locate package cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
E: Couldn't find any package by glob 'cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb'
E: Couldn't find any package by regex 'cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb'
E: Unable to locate package cuda-repo-ubuntu1710-9-2-local_9.2.88-1_amd64
E: Couldn't find any package by glob 'cuda-repo-ubuntu1710-9-2-local_9.2.88-1_amd64'
E: Couldn't find any package by regex 'cuda-repo-ubuntu1710-9-2-local_9.2.88-1_amd64'
E: Unable to locate package cuda-repo-ubuntu1710-9-2-local-cublas-update-1_1.0-1_amd64
E: Couldn't find any package by glob 'cuda-repo-ubuntu1710-9-2-local-cublas-update-1_1.0-1_amd64'
E: Couldn't find any package by regex 'cuda-repo-ubuntu1710-9-2-local-cublas-update-1_1.0-1_amd64'

As you can see it is kind of messy now… How can I remove those packages properly?

I already checked https://devtalk.nvidia.com/default/topic/1000340/cuda-setup-and-installation/-quot-nvidia-smi-has-failed-because-it-couldn-t-communicate-with-the-nvidia-driver-quot-ubuntu-16-04/2 and https://devtalk.nvidia.com/default/topic/1000340/-quot-nvidia-smi-has-failed-because-it-couldn-t-communicate-with-the-nvidia-driver-quot-ubuntu-16-04/?offset=35. None of those suggestions worked for me.

Please help. Kind regards,
Thomy800

drivers installed from ppa generally don’t have what’s needed for CUDA activity, and don’t include nvidia-smi.

I would start with a clean install of the linux OS and use the runfile installer, either for the driver itself or for CUDA.

Saying “Didn’t work” is not useful from a troubleshooting perspective.

I would encourage you to read the linux install guide thoroughly.

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Here a more complete output.

mycomputer3:~$ sudo sh cuda_9.2.88_396.26_linux
Logging to /tmp/cuda_install_5693.log

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 396.26?
(y)es/(n)o/(q)uit: y

Do you want to install the OpenGL libraries?
(y)es/(n)o/(q)uit [ default is yes ]: y

Do you want to run nvidia-xconfig?
This will update the system X configuration file so that the NVIDIA X driver
is used. The pre-existing X configuration file will be backed up.
This option should not be used on systems that require a custom
X configuration, such as systems with multiple GPU vendors.
(y)es/(n)o/(q)uit [ default is no ]: y

Install the CUDA 9.2 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.2 ]:

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.2 Samples?
(y)es/(n)o/(q)uit: n

Installing the NVIDIA display driver...
Installing the CUDA Toolkit in /usr/local/cuda-9.2 ...

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-9.2
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-9.2/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.2/lib64, or, add /usr/local/cuda-9.2/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.2/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.2/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_5693.log
mycomputer3:~$ echo $LD_LIBRARY_PATH
/usr/local/cuda-9.2/lib64:/usr/local/cuda/extras/CUPTI/lib64/
mycomputer3:~$ /usr/local/cuda
cuda/     cuda-9.2/
mycomputer3:~$ sudo reboot
...
mycomputer3:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

mycomputer3:~$ lsmod | grep nvidia
mycomputer3:~$

You might just need to reboot.

reboot did not lead to any change.

What is the output of:

lspci |grep -i nvidia

and:

dmesg |grep NVRM

and

lsmod |grep nv

and

lsmod |grep nouv

?

mycomputer3:~$ lspci |grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
06:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
09:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation GM200 High Definition Audio (rev a1)
mycomputer3:~$ dmesg |grep NVRM
mycomputer3:~$ lsmod |grep nv
mycomputer3:~$ lsmod |grep nouv
mycomputer3:~$
1 Like

what is the output of:

cat /tmp/cuda_install_5693.log

and

cat /var/log/nvidia-installer.log

since I rebooted and the log file was located in the tmp directory it was deleted:

mycomputer3:~$ /tmp/cuda_install_5693.log
-bash: /tmp/cuda_install_5693.log: No such file or directory

/var/log/nvidia-installer.log

mycomputer3:~$ cat /var/log/nvidia-installer.log
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Fri Jun  1 16:05:18 2018
installer version: 396.26

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --ui=none
    --no-questions
    --accept-license
    --disable-nouveau
    --run-nvidia-xconfig
    --dkms

Using built-in stream user interface
-> Detected 8 CPUs online; setting concurrency level to 8.
-> Installing NVIDIA driver version 396.26.
-> Running distribution scripts
   executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed!  Are you sure you want to continue? (Answer: Continue installation)
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system since these files were written.  If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to reenable Nouveau, you will need to delete these files: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
-> Installing both new and classic TLS OpenGL libraries.
-> Installing both new and classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.396.26"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.396.26"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.396.26"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.396.26"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Can't load library libGLdispatch.so.0: libGLdispatch.so.0: cannot open shared object file: No such file or directory
Will install libglvnd libraries.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (396.26):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
-> done.
ERROR: Unable to load the 'nvidia-drm' kernel module.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

lines 60 and 61 in your last output may be the problem. I’m not sure why that is.

If you google that error you will find things like this:

https://devtalk.nvidia.com/default/topic/1028367/linux/unable-to-load-the-nvidia-drm-kernel-module-on-ubuntu-16-04/

https://devtalk.nvidia.com/default/topic/1031913/linux/driver-390-42-installation-failed-unable-to-load-the-nvidia-drm-/

You may want to try some of the things suggested there.

Hi,

I’ve read those threads you linked to and I just checked a few suggestions.

I do not think, the gcc is too old since I got version 5, 6 and 7.

Neither do I have kernel version 116.

acpi_osi flag:

sudo nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_osi=Linux"
sudo update-grub

Which apparently did not change anything.

However I was running:

sudo startx -- -logverbose 6
sudo nvidia-bug-report.sh

which reported: Running nvidia-bug-report.sh…ls: cannot access ‘/proc/driver/nvidia/./gpus/’: No such file or directory
Is it supposed to do that?
I attached the log anyways. Can you find something unusual?

Thanks for your help,
Thomy800
nvidia-bug-report.log (318 KB)