Nvlinked Titan RTX Chips: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

I’m sorry to post a question that’s been asked before, but I’m trying to configure Cuda on an Ubuntu 18.04 machine and am a little stuck.

I tried to install Cuda 10.0 from a bash installer. I opted to install the driver too when running the install. Now, however, running nvidia-smi yields:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I attempted to completely uninstall all drivers with the following:

sudo apt-get --purge -y remove 'cuda*'
sudo apt-get --purge -y remove 'nvidia*'

I then tried to install a new driver with sudo apt install nvidia-driver-440, but I continue to get “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver.”

I’m now trying to figure out how to get my nvidia driver to work peacefully with my system and enable nvidia-smi. If that’s not possible, I’d like to install a new driver. Any pointers on how to achieve either of these paths would be hugely helpful! I’m happy to provide any information along the way that might be useful…

Some clues: X seems to be disabled now (see below) but normally there’s a console. Also `inxi -G1 is reporting that I’m still using the nouveau driver instead of nvidia.

The machine has two Titan RTX chips attached. Here’s a screenshot of them in action yesterday before I borked the drivers:

enter image description here

Here is the content of /var/log/nvidia-installer.log:

nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Apr  9 21:55:41 2020
installer version: 410.48

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    --ui=none
    --no-questions
    --accept-license
    --disable-nouveau
    --no-cc-version-check
    --dkms

Using built-in stream user interface
-> Detected 32 CPUs online; setting concurrency level to 32.
-> Installing NVIDIA driver version 410.48.
-> There appears to already be a driver installed on your system (version: 410.48).  As part of installing this driver (version: 410.48), the existing driver will be uninstalled.  Are you sure you want to continue? (Answer: Continue installation)
-> Running distribution scripts
   executing: '/usr/lib/nvidia/pre-install'...
-> done.
-> The distribution-provided pre-install script failed!  Are you sure you want to continue? (Answer: Continue installation)
WARNING: One or more modprobe configuration files to disable Nouveau are already present at: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.  Please be sure you have rebooted your system since these files were written.  If you have rebooted, then Nouveau may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration file.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
-> For some distributions, Nouveau can be disabled by adding a file in the modprobe configuration directory.  Would you like nvidia-installer to attempt to create this modprobe file for you? (Answer: Yes)
-> One or more modprobe configuration files to disable Nouveau have been written.  For some distributions, this may be sufficient to disable Nouveau; other distributions may require modification of the initial ramdisk.  Please reboot your system and attempt NVIDIA driver installation again.  Note if you later wish to reenable Nouveau, you will need to delete these files: /etc/modprobe.d/nvidia-installer-disable-nouveau.conf
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
-> Installing both new and classic TLS OpenGL libraries.
-> Installing both new and classic TLS 32bit OpenGL libraries.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.410.48"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.410.48"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.410.48"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.410.48"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Uninstalling the previous installation with /usr/bin/nvidia-uninstall.
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Checking libGLdispatch dispatch table
   Checking call through libGLdispatch
   All OK
   libGLdispatch is OK
   Checking for libGLX
   libGLX is OK
   Checking for libEGL
   libEGL is OK
   Checking entrypoint library libOpenGL.so.0
   Checking call through libGLdispatch
   Checking call through library libOpenGL.so.0
   All OK
   Entrypoint library libOpenGL.so.0 is OK
   Checking entrypoint library libGL.so.1
   Checking call through libGLdispatch
   Checking call through library libGL.so.1
   All OK
   Entrypoint library libGL.so.1 is OK
   
   Found libglvnd libraries: libGL.so.1 libOpenGL.so.0 libEGL.so.1 libGLX.so.0 libGLdispatch.so.0 
   Missing libglvnd libraries: 
   
   libglvnd appears to be installed.
Will not install libglvnd libraries.
-> Skipping GLVND file: "libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "libGLdispatch.so.0"
-> Skipping GLVND file: "libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
-> Skipping GLVND file: "./32/libOpenGL.so.0"
-> Skipping GLVND file: "libOpenGL.so"
-> Skipping GLVND file: "./32/libGLdispatch.so.0"
-> Skipping GLVND file: "./32/libGLESv2.so.2.1.0"
-> Skipping GLVND file: "libGLESv2.so.2"
-> Skipping GLVND file: "libGLESv2.so"
-> Skipping GLVND file: "./32/libGLESv1_CM.so.1.2.0"
-> Skipping GLVND file: "libGLESv1_CM.so.1"
-> Skipping GLVND file: "libGLESv1_CM.so"
-> Skipping GLVND file: "./32/libGL.so.1.7.0"
-> Skipping GLVND file: "libGL.so.1"
-> Skipping GLVND file: "libGL.so"
-> Skipping GLVND file: "./32/libGLX.so.0"
-> Skipping GLVND file: "libGLX.so"
-> Skipping GLVND file: "./32/libEGL.so.1.1.0"
-> Skipping GLVND file: "libEGL.so.1"
-> Skipping GLVND file: "libEGL.so"
WARNING: Unable to determine the path to install the libglvnd EGL vendor library config files. Check that you have pkg-config and the libglvnd development libraries installed, or specify a path with --glvnd-egl-config-path.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (410.48):
   executing: '/sbin/ldconfig'...
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 410.48 -k 5.3.0-46-generic`: 
Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j32 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=5.3.0-46-generic IGNORE_CC_MISMATCH='1' modules....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 410.48 not found
Error! Bad return status for module build on kernel: 5.3.0-46-generic (x86_64)
Consult /var/lib/dkms/nvidia/410.48/build/make.log for more information.
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

The following is displayed when I run sudo startx -- -logverbose 6:

X.Org X Server 1.20.5
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.4.0-170-generic x86_64 Ubuntu
Current Operating System: Linux threadripper 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.3.0-46-generic root=UUID=73811e0a-1a2a-4092-ab1a-72ad097e9fa4 ro quiet splash vt.handoff=1
Build Date: 18 December 2019  08:15:29AM
xorg-server-hwe-18.04 2:1.20.5+git20191008-0ubuntu1~18.04.1 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.34.0
	Before reporting problems, check http://wiki.x.org
	to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
	(++) from command line, (!!) notice, (II) informational,
	(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Fri Apr 10 11:33:28 2020
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

With X running, I ran sudo nvidia-bug-report.sh and got this output.

Here is the result of apt-cache policy:

$ apt-cache policy
Package files:
 100 /var/lib/dpkg/status
     release a=now
 500 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main i386 Packages
     release v=18.04,o=LP-PPA-graphics-drivers,a=bionic,n=bionic,l=Proprietary GPU Drivers,c=main,b=i386
     origin ppa.launchpad.net
 500 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages
     release v=18.04,o=LP-PPA-graphics-drivers,a=bionic,n=bionic,l=Proprietary GPU Drivers,c=main,b=amd64
     origin ppa.launchpad.net
 500 http://dl.google.com/linux/chrome/deb stable/main amd64 Packages
     release v=1.0,o=Google LLC,a=stable,n=stable,l=Google,c=main,b=amd64
     origin dl.google.com
 600 http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
     release o=NVIDIA,l=NVIDIA CUDA,c=
     origin developer.download.nvidia.com
 500 http://security.ubuntu.com/ubuntu bionic-security/multiverse i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=multiverse,b=i386
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=multiverse,b=amd64
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/universe i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=universe,b=i386
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=universe,b=amd64
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/restricted i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=restricted,b=i386
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=restricted,b=amd64
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/main i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=main,b=i386
     origin security.ubuntu.com
 500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-security,n=bionic,l=Ubuntu,c=main,b=amd64
     origin security.ubuntu.com
 100 http://us.archive.ubuntu.com/ubuntu bionic-backports/universe i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-backports,n=bionic,l=Ubuntu,c=universe,b=i386
     origin us.archive.ubuntu.com
 100 http://us.archive.ubuntu.com/ubuntu bionic-backports/universe amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-backports,n=bionic,l=Ubuntu,c=universe,b=amd64
     origin us.archive.ubuntu.com
 100 http://us.archive.ubuntu.com/ubuntu bionic-backports/main i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-backports,n=bionic,l=Ubuntu,c=main,b=i386
     origin us.archive.ubuntu.com
 100 http://us.archive.ubuntu.com/ubuntu bionic-backports/main amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-backports,n=bionic,l=Ubuntu,c=main,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/multiverse i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=multiverse,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=multiverse,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/universe i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=universe,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=universe,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/restricted i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=restricted,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=restricted,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/main i386 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=main,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic-updates,n=bionic,l=Ubuntu,c=main,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/multiverse i386 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=multiverse,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/multiverse amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=multiverse,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/universe i386 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=universe,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=universe,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/restricted i386 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=restricted,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/restricted amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=restricted,b=amd64
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/main i386 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=main,b=i386
     origin us.archive.ubuntu.com
 500 http://us.archive.ubuntu.com/ubuntu bionic/main amd64 Packages
     release v=18.04,o=Ubuntu,a=bionic,n=bionic,l=Ubuntu,c=main,b=amd64
     origin us.archive.ubuntu.com
Pinned packages:
$

output of inxi -G:

Graphics:  Card-1: NVIDIA Device 1e02
           Card-2: NVIDIA Device 1e02
           Display Server: N/A drivers: fbdev,nouveau (unloaded: modesetting,vesa)
           tty size: 128x26 Advanced Data: N/A out of X

output of lsmod | grep nvidia:

i2c_nvidia_gpu         16384  0

output of grep -i "nvidia" /var/log/Xorg.0.log:

[     9.324] (II) NOUVEAU driver for NVIDIA chipset families :
[     9.912] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=8 (/dev/input/event22)
[     9.912] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=7 (/dev/input/event21)
[     9.912] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=3 (/dev/input/event20)
[     9.912] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=8 (/dev/input/event18)
[     9.913] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=9 (/dev/input/event23)
[     9.913] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=9 (/dev/input/event19)
[     9.913] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=3 (/dev/input/event16)
[     9.913] (II) config/udev: Adding input device HDA NVidia HDMI/DP,pcm=7 (/dev/input/event17)

output of dpkg -l | grep nvidia:

ii  libnvidia-cfg1-440:amd64                   440.64.00-0ubuntu1                               amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-440                       440.64.00-0ubuntu1                               all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-390:i386                 390.116-0ubuntu0.18.04.1                         i386         NVIDIA libcompute package
ii  libnvidia-compute-440:amd64                440.64.00-0ubuntu1                               amd64        NVIDIA libcompute package
ii  libnvidia-decode-440:amd64                 440.64.00-0ubuntu1                               amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-440:amd64                 440.64.00-0ubuntu1                               amd64        NVENC Video Encoding runtime library
ii  libnvidia-fbc1-440:amd64                   440.64.00-0ubuntu1                               amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-440:amd64                     440.64.00-0ubuntu1                               amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-440:amd64                   440.64.00-0ubuntu1                               amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-compute-utils-440                   440.64.00-0ubuntu1                               amd64        NVIDIA compute utilities
ii  nvidia-cuda-dev                            9.1.85-3ubuntu1                                  amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                            9.1.85-3ubuntu1                                  all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                            9.1.85-3ubuntu1                                  amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                        9.1.85-3ubuntu1                                  amd64        NVIDIA CUDA development toolkit
ii  nvidia-dkms-440                            440.64.00-0ubuntu1                               amd64        NVIDIA DKMS package
ii  nvidia-driver-440                          440.64.00-0ubuntu1                               amd64        NVIDIA driver metapackage
ii  nvidia-kernel-common-440                   440.64.00-0ubuntu1                               amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-440                   440.64.00-0ubuntu1                               amd64        NVIDIA kernel source package
ii  nvidia-opencl-dev:amd64                    9.1.85-3ubuntu1                                  amd64        NVIDIA OpenCL development files
ii  nvidia-prime                               0.8.8.2                                          all          Tools to enable NVIDIA's Prime
ii  nvidia-profiler                            9.1.85-3ubuntu1                                  amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                            440.64.00-0ubuntu1                               amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-440                           440.64.00-0ubuntu1                               amd64        NVIDIA driver support binaries
ii  nvidia-visual-profiler                     9.1.85-3ubuntu1                                  amd64        NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia-440              440.64.00-0ubuntu1                               amd64        NVIDIA binary Xorg driver

output of lspci:

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
00:19.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
00:19.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller (rev 02)
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller (rev 02)
01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge (rev 02)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
03:00.0 Network controller: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter (rev 32)
04:00.0 Network controller: Wilocity Ltd. Wil6200 802.11ad Wireless Network Adapter (rev 02)
05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
08:00.0 USB controller: ASMedia Technology Inc. Device 2142
09:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
0a:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1)
0a:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
0a:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
0a:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
0b:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
0c:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
0c:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 1455
0d:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
0d:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex
40:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit
40:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
40:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
40:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
40:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
40:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
41:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
42:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
43:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1)
43:00.1 Audio device: NVIDIA Corporation Device 10f7 (rev a1)
43:00.2 USB controller: NVIDIA Corporation Device 1ad6 (rev a1)
43:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad7 (rev a1)
44:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
44:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
44:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
45:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 1455
45:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

output of lshw -class display:

  *-display UNCLAIMED
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:0a:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller cap_list
       configuration: latency=0
       resources: iomemory:480-47f iomemory:480-47f memory:ba000000-baffffff memory:4820000000-482fffffff memory:4830000000-4831ffffff ioport:2000(size=128) memory:bb000000-bb07ffff
  *-display UNCLAIMED
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:43:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller cap_list
       configuration: latency=0
       resources: memory:9b000000-9bffffff memory:80000000-8fffffff memory:90000000-91ffffff ioport:4000(size=128) memory:9c000000-9c07ffff

output of find /lib/modules -type f | grep nvidia:

/lib/modules/5.3.0-46-generic/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.3.0-46-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.3.0-46-generic/kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko
/lib/modules/5.3.0-46-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.3.0-46-generic/updates/dkms/nvidia-drm.ko
/lib/modules/5.3.0-46-generic/updates/dkms/nvidia.ko
/lib/modules/5.3.0-46-generic/updates/dkms/nvidia-uvm.ko
/lib/modules/5.3.0-46-generic/updates/dkms/nvidia-modeset.ko
/lib/modules/5.3.0-40-generic/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.3.0-40-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.3.0-40-generic/kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko
/lib/modules/5.3.0-40-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/4.15.0-96-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/4.15.0-96-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko
/lib/modules/5.3.0-45-generic/kernel/drivers/usb/typec/altmodes/typec_nvidia.ko
/lib/modules/5.3.0-45-generic/kernel/drivers/video/fbdev/nvidia/nvidiafb.ko
/lib/modules/5.3.0-45-generic/kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko
/lib/modules/5.3.0-45-generic/kernel/drivers/net/ethernet/nvidia/forcedeth.ko

The output of apt list --installed

The contents of /var/log/apt/history.*

2 Likes

BTW, .run installers can be uninstalled using --uninstall option.
There seems to be driver installed but it doesn’t load. My guess would be a blacklist file. Please run

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

to find a file containing

blacklist nvidia

and remove it,
then run

sudo update-initramfs -u

and reboot.

1 Like

@generix oh my word. I just spent 30 hours trying to get this straightened out and you nailed it in two lines! What a sniper!

Thank you a million!

I have a similar problem. I try to install tensorflow gpu. If I install Ubuntu 20.04 and during installation check in the box that drivers will be installed - the nvidia-smiworks, shows that I have Cuda 10.2. I need Cuda 10.1.

I have tried 3 different options all yielding the same result:
Install 20.04 without drivers
Uninstall Cuda 10.2
Install 18.04

All these versions gives
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Nvidia is mentioned in blacklist-framebuffer.conf, so I put a # infront of it and run
udo update-initramfs -u
and reboot - still get the same error when I run nvidia-smi - both in Ubuntu 20.04 and 18.04

You’re misinterpreting the output. nvidia-smi displays the maximum supported cuda version of the driver, not which cuda-toolkit version you actually have installed.

Perfect, just double checking that I have understood:

  1. I install Ubuntu 20.04 with drivers (nvidia-smi will work)
  2. I install tensorflow as below, but I skip the “sudo apt-get install --no-install-recommends nvidia-driver-430” as I already have the 440 driver installed

Is this set-up OK?

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

// Install NVIDIA driver - I will not do this part
sudo apt-get install --no-install-recommends nvidia-driver-430
// Reboot. Check that GPUs are visible using the command: nvidia-smi

// Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends
cuda-10-1
libcudnn7=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1

// Install TensorRT. Requires that libcudnn7 is installed above.
sudo apt-get install -y --no-install-recommends libnvinfer6=6.0.1-1+cuda10.1
libnvinfer-dev=6.0.1-1+cuda10.1
libnvinfer-plugin6=6.0.1-1+cuda10.1

No. One detail of these instructions is wrong, installing
cuda-10-1
will overwrite the driver with an older, most likely incompatible driver. Instead, install
cuda-toolkit-10-1
which will leave the repo driver intact.

Thanks!! Very much appriciated!!!

Just double cheking - I will not use this:

instead I will use

Yes.

I manage to install as above without any errors
I installed pip install tensorflow-gpu==1.15
and run tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)
Then I get these errors

2020-06-19 08:37:36.314925: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-19 08:37:36.316446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
2020-06-19 08:37:36.316759: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcudart.so.10.0’; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.316959: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcublas.so.10.0’; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.317145: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcufft.so.10.0’; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.317324: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcurand.so.10.0’; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.317506: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusolver.so.10.0’; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.317692: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusparse.so.10.0’; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64
2020-06-19 08:37:36.317740: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-19 08:37:36.317762: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at 使用 pip 安装 TensorFlow for how to download and setup the required libraries for your platform.
Skipping registering GPU devices…
2020-06-19 08:37:36.317804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-19 08:37:36.317827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-06-19 08:37:36.317846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N

What is the problem?

The tensorflow you’re using wants cuda 10.0, you have installed 10.1.

I do not get this, I have just followed the official tensorflow instruction

With the exception that I did not install the Nvidia driver and used cuda-toolkit-10-1 unstead of cuda-10-1

It really says Ubuntu 18.04 (CUDA 10.1). Super strange!
How can I downgrade to 10.0?

From install instructions:

TensorFlow supports CUDA 10.1 (TensorFlow >= 2.1.0)

To go back to cuda 10.0, you would have to purge anything nvidia and cuda, then start from scratch and find the correct cudnn version.
It’s always annoying that it’s not clearly stated which cuda/cudnn versions the tensorflow binaries are built against. So it’s sometimes better to install tensorflow first, let it run and check the error messages what libraries it’s actually looking for to know which cuda version to install. Otherwise, build tensorflow from source.

How do I purge everything?

Then I guess that I have to find the 10.0 version of some of these files?

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo dpkg -i cuda-repo-ubuntu1804_10.1.243-1_amd64.deb
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt remove *nvidia* cuda*

Edit:
you’ll probably also have to remove cudnn and libnvinfer:

sudo apt remove cudnn* libnvinfer*

should remove anything (including the driver).
reinstall the driver
sudo apt install nvidia-driver-440

Cuda 10.0:
https://developer.nvidia.com/cuda-10.0-download-archive
The machine-learning repo is the same, but you’ll have to install other cudnn and nvinfer versions
Check the repo contents:
http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/
and install the versions that end with cuda10.0

how did you remove it thought?

The result of
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/*

is
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb

I suppose this is what you refer?
Should I remove it? How?

1 Like

Hi,

I tried this solution still got the same error. The solution which worked for me was to install the higher version of cuda-toolkit. But for some reason, the error continues recurring after a few days, and each time, I have to install a version of cuda that is more recent than the one already installed. Any ideas as to why nvidia-driver suddenly stops functioning?