CUDA 9.1 on Ubuntu 16.04 installed, but deviceQuery fails

nvidia-bug-report.sh output: nvidia-bug-report.log.gz - Google Drive

$ uname -a
Linux roswell 4.15.0-38-generic #41~16.04.1-Ubuntu SMP Wed Oct 10 20:16:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I am following the CUDA installation guide. I need to use CUDA 9.1 because it’s the version used by the tools I ultimately need to work with.

I installed the latest NVIDIA driver:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  410.73  Sat Oct 20 22:12:33 CDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)

I installed CUDA 9.1 from the net installer.

I built the samples.

deviceQuery failed:

$ bin/x86_64/linux/release/deviceQuery
bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

Any troubleshooting help would be most appreciated.

Here’s some output from your bug report log (specifically from dmesg |grep NVRM):

Nov 02 07:58:59 roswell kernel: NVRM: API mismatch: the client has the version 387.26, but
                                NVRM: this kernel module has the version 410.73.  Please
                                NVRM: make sure that this kernel module and all NVIDIA driver
                                NVRM: components have the same version.
Nov 02 07:58:59 roswell kernel: NVRM: API mismatch: the client has the version 387.26, but
                                NVRM: this kernel module has the version 410.73.  Please
                                NVRM: make sure that this kernel module and all NVIDIA driver
                                NVRM: components have the same version.
Nov 02 08:09:08 roswell systemd[1]: Configuration file /lib/systemd/system/nvidia-persistenced.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Nov 02 08:09:09 roswell userdel[29548]: delete user 'nvidia-persistenced'
Nov 02 08:09:09 roswell userdel[29548]: removed group 'nvidia-persistenced' owned by 'nvidia-persistenced'
Nov 02 08:09:09 roswell userdel[29548]: removed shadow group 'nvidia-persistenced' owned by 'nvidia-persistenced'
Nov 02 08:32:24 roswell gnome-session[1637]: (gnome-software:1755): As-WARNING **: failed to rescan: Failed to parse /usr/share/applications/nvidia-settings.desktop.dpkg-new file: cannot process file of type text/plain
Nov 02 08:32:24 roswell gnome-session[1637]: (gnome-software:1755): As-WARNING **: failed to rescan: Failed to parse /usr/share/applications/nvidia-settings.desktop.dpkg-tmp file: cannot process file of type text/plain
Nov 02 08:32:24 roswell gnome-session[1637]: (gnome-software:1755): As-WARNING **: failed to rescan: Failed to parse /usr/share/applications/nvidia-settings.desktop file: cannot process file of type application/x-desktop
Nov 02 08:32:29 roswell groupadd[8035]: group added to /etc/group: name=nvidia-persistenced, GID=131
Nov 02 08:32:29 roswell groupadd[8035]: group added to /etc/gshadow: name=nvidia-persistenced
Nov 02 08:32:29 roswell groupadd[8035]: new group: name=nvidia-persistenced, GID=131
Nov 02 08:32:29 roswell useradd[8039]: new user: name=nvidia-persistenced, UID=124, GID=131, home=/, shell=/sbin/nologin
Nov 02 08:32:29 roswell usermod[8044]: change user 'nvidia-persistenced' password
Nov 02 08:32:29 roswell chage[8049]: changed password expiry for nvidia-persistenced
Nov 02 08:32:29 roswell chfn[8052]: changed user 'nvidia-persistenced' information
Nov 02 08:55:23 roswell kernel: NVRM: API mismatch: the client has the version 410.72, but
                                NVRM: this kernel module has the version 410.73.  Please
                                NVRM: make sure that this kernel module and all NVIDIA driver
                                NVRM: components have the same version.
Nov 02 10:27:44 roswell kernel: NVRM: API mismatch: the client has the version 410.72, but
                                NVRM: this kernel module has the version 410.73.  Please
                                NVRM: make sure that this kernel module and all NVIDIA driver
                                NVRM: components have the same version.

So it looks like you’ve been installing multiple driver versions.

If you want to use CUDA, I recommend installing drivers only from an NVIDIA source, not from ppa archives or any other source. Furthermore, depending on how you installed each of these several drivers, things may be very messed up. If at any point you mixed runfile install after a previous package manager install, that is a recipe to break things.

I would recommend following the instructions in the linux install guide regarding “handling conflicting installations” to completely clean out all old installs of GPU drivers. Then pick a driver to install, and follow the linux install guide carefully.

I also note that your GPU is driving a display. if this is on a laptop, be advised that laptop linux installs may require extra effort, such as careful use of nvidia-prime

Thanks for the quick reply.

I thought I did uninstall the first attempt, which was using the local installer. Apparently it left some junk around.

The install guide says to get the latest driver from NVIDIA. Check, got it - it only comes in a .run version, at least through the public website.

The pinned post here in the forum says to use the net-installer for CUDA. It is only available in a .deb

How do I reconcile the 2 best-practices which are at odds with each other?

Since the net-installer for CUDA seemed to have the 410.72 driver, do I really need the 410.73 driver that the .run installs?

One pinned post refers to CUDA 8. Are you installing CUDA 8?
The pinned post referencing CUDA 9.1 doesn’t say anywhere in it “you must use a network installer”. It says:

Before installing CUDA 9.1, ensure that you have the latest NVIDIA driver R390 installed. The latest NVIDIA R390 driver is available at: www.nvidia.com/drivers

The CUDA network repositories have also been updated with the latest R390 driver packages. For more information about installing driver and CUDA from the network repository, see the Linux Installation Guide at: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

You can use a runfile installation of a driver with a deb-installed cuda toolkit.

Read the linux install guide. It indicates how to do a toolkit-only install using a package manager method.

The problems arise when you attempt to install a driver via package manager, and then a driver via runfile installer, without doing a full cleanup in-between.

There is no conflict in using a driver installed via runfile installer, with a cuda toolkit install (no driver) via package manager.

Read the linux install guide. In its entirety.

The 410.72 driver should be fine. The 410.73 should also be fine.

Just don’t mix a driver runfile install with a driver package manager install (unless you do a full cleanup in-between).

FYI to wrap this up.

I ran apt remove on all cuda and nvidia packages except for the repo definition.
I ran apt list -i | grep -i cuda and apt list -i | grep -i nvidia to make sure only the repo definition remained.

I then re-ran the net install for the “cuda-9-1” target and rebooted.

The deviceQuery sample program now works.