[SOLVED] CUDA 9.0rc and NVIDIA 384.69 but driver version is insufficient for CUDA runtime version

poppingtonic · September 7, 2017, 9:09pm

I’m running Ubuntu 16.04.3, with the Nvidia 384.69 drivers installed through Ubuntu’s “Software & Updates” > “Additional Drivers” UI. I also installed bumblebee, primus, mesa and bumblebee-nvidia.

I’ve also set the correct PATH

# NVIDIA
export PATH="$PATH:/usr/local/cuda-9.0/bin"
export PATH="$PATH:/usr/lib/nvidia-384/bin"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0/lib"
export CUDA_HOME=/usr/local/cuda-9.0
export CUDADIR=/usr/local/cuda-9.0
export GLPATH=/usr/lib

$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.69  Wed Aug 16 19:34:54 PDT 2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

dpkg -l | grep nvidia
ii  bumblebee-nvidia                                            3.2.1-10                                     amd64        NVIDIA Optimus support using the proprietary NVIDIA driver
ii  nvidia-384                                                  384.69-0ubuntu0~gpu16.04.1                   amd64        NVIDIA binary driver - version 384.69
ii  nvidia-opencl-icd-384                                       384.69-0ubuntu0~gpu16.04.1                   amd64        NVIDIA OpenCL ICD
rc  nvidia-prime                                                0.8.2                                        amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                             384.69-0ubuntu0~gpu16.04.1                   amd64        Tool for configuring the NVIDIA graphics driver

I installed CUDA 9.0rc through the runfile file method, ignoring the option to install the drivers, which is older (384.59). When I compile the CUDA 9.0 deviceQuery sample, I get this error:

optirun ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 30
-> unknown error
Result = FAIL

Without optirun,

./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

nvidia-bug-report.log.gz (60.5 KB)

slazzo · September 8, 2017, 11:04pm

Message removed. Not related.

Robert_Crovella · September 8, 2017, 11:09pm

CUDA 9 (and 8, and 7) require newer drivers than the 342.01 driver you have. That old GPU is no longer supported by recent CUDA versions. The last CUDA version supporting it was CUDA 6.5. This is expected behavior, that CUDA 9 would not work with that notebook.

That is indicated by this error:

cudaErrorInsufficientDriver

and is discussed in many other postings on this forum.

It has nothing to do with, and is not related to the linux issue reported by OP in this thread.

Robert_Crovella · September 8, 2017, 11:35pm

@poppingtonic, referring to the contents of the log file you attached, it seems you have an Acer Predator laptop with GTX 1060m GPU. You should be advised that laptops that originally ship with windows in a optimus configuration can be fairly challenging to set up properly in linux. Having said that, there are a few probably more basic issues in your setup.

I generally don’t recommend that people use an NVIDIA driver from a source other than NVIDIA. The NVIDIA driver can be packaged in a variety of ways, with or without certain modules, and the exclusion of some of these things may make certain features (like CUDA) unusable, or incorrectly configured. Looking at the dmesg log contained in the bug report log, this would seem to be the case to me:

/var/log/dmesg:

journalctl -b -0:
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-opencl-icd-375.list: No such file or directory
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-modprobe.list: No such file or directory
Sep 07 23:33:15 aleph0 ureadahead[366]: ureadahead:/var/lib/dpkg/info/nvidia-375.list: No such file or directory
Sep 07 23:33:57 aleph0 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 07 23:33:57 aleph0 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module 384.69 Wed Aug 16 19:34:54 PDT 2017 (using threaded interrupts)
Sep 07 23:33:57 aleph0 systemd[4855]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:33:57 aleph0 systemd-udevd[4864]: failed to execute ‘/usr/bin/nvidia-smi’ ‘/usr/bin/nvidia-smi’: No such file or directory
Sep 07 23:33:57 aleph0 systemd-udevd[4850]: Process ‘/usr/bin/nvidia-smi’ failed with exit code 2.
Sep 07 23:33:57 aleph0 systemd[4871]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:33:57 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:33:57 aleph0 bumblebeed[1083]: [ 50.171598] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:34:01 aleph0 systemd[4911]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:34:01 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:34:01 aleph0 bumblebeed[1083]: [ 54.301667] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:35:12 aleph0 systemd[5691]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:35:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:35:12 aleph0 bumblebeed[1083]: [ 125.013497] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:42:12 aleph0 systemd[6774]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:42:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:42:12 aleph0 bumblebeed[6736]: [ 544.841892] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:42:18 aleph0 systemd[6828]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:42:18 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:42:18 aleph0 bumblebeed[6736]: [ 551.142928] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory
Sep 07 23:43:12 aleph0 systemd[7097]: nvidia-persistenced.service: Failed at step EXEC spawning /usr/bin/nvidia-persistenced: No such file or directory
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=203
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Unit entered failed state.
Sep 07 23:43:12 aleph0 systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Sep 07 23:43:13 aleph0 bumblebeed[6736]: [ 605.570955] [ERROR][XORG] (EE) Failed to load /usr/lib/nvidia-384/xorg/libglx.so: libnvidia-tls.so.384.69: cannot open shared object file: No such file or directory

In short, I would say your driver install is broken. And this would be also indicated by this basic problem report:

→ CUDA driver version is insufficient for CUDA runtime version

It also appears that you have some components from another driver branch. (375)

If you start over with a clean OS load, and actually install the driver from NVIDIA, I think you may have better luck.

Get it working with 384.59 first. The difference between that and 384.69 is not that important if you want to get CUDA running.

poppingtonic · September 9, 2017, 10:50am

Here’s a new log, after clearing other driver components. I installed the driver using the runfile, then uninstalled it to try using 384.69.

nvidia-bug-report.log.gz (60.6 KB)
dump.zip (46 KB)

poppingtonic · September 10, 2017, 10:48pm

Some good news, I got this to work.

python examples/mnist_cnn.py
Using TensorFlow backend.
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-09-11 01:31:39.863286: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU c
omputations.
2017-09-11 01:31:39.863308: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU c
omputations.
2017-09-11 01:31:39.863331: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU comp
utations.
2017-09-11 01:31:39.863335: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU com
putations.
2017-09-11 01:31:39.863365: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU comp
utations.
2017-09-11 01:31:40.193552: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA n
ode zero
2017-09-11 01:31:40.194152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1060
major: 6 minor: 1 memoryClockRate (GHz) 1.6705
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.86GiB
2017-09-11 01:31:40.194166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-11 01:31:40.194188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-09-11 01:31:40.194218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0)
60000/60000 [==============================] - 9s - loss: 0.3327 - acc: 0.8996 - val_loss: 0.0784 - val_acc: 0.9756
Epoch 2/12
60000/60000 [==============================] - 6s - loss: 0.1113 - acc: 0.9669 - val_loss: 0.0558 - val_acc: 0.9817
Epoch 3/12
60000/60000 [==============================] - 6s - loss: 0.0835 - acc: 0.9751 - val_loss: 0.0428 - val_acc: 0.9856
Epoch 4/12
60000/60000 [==============================] - 6s - loss: 0.0697 - acc: 0.9794 - val_loss: 0.0366 - val_acc: 0.9881
Epoch 5/12
60000/60000 [==============================] - 6s - loss: 0.0609 - acc: 0.9818 - val_loss: 0.0352 - val_acc: 0.9885
Epoch 6/12
60000/60000 [==============================] - 6s - loss: 0.0555 - acc: 0.9835 - val_loss: 0.0323 - val_acc: 0.9897
Epoch 7/12
60000/60000 [==============================] - 6s - loss: 0.0501 - acc: 0.9853 - val_loss: 0.0307 - val_acc: 0.9901
Epoch 8/12
60000/60000 [==============================] - 6s - loss: 0.0444 - acc: 0.9863 - val_loss: 0.0278 - val_acc: 0.9907
Epoch 9/12
60000/60000 [==============================] - 6s - loss: 0.0430 - acc: 0.9868 - val_loss: 0.0306 - val_acc: 0.9900
Epoch 10/12
60000/60000 [==============================] - 6s - loss: 0.0403 - acc: 0.9879 - val_loss: 0.0292 - val_acc: 0.9901
Epoch 11/12
60000/60000 [==============================] - 6s - loss: 0.0372 - acc: 0.9889 - val_loss: 0.0302 - val_acc: 0.9903
Epoch 12/12
60000/60000 [==============================] - 6s - loss: 0.0367 - acc: 0.9888 - val_loss: 0.0270 - val_acc: 0.9910

I resolved this by doing the following:

I stopped assuming that optirun would start the card and insert the relevant kernel modules. Here’s what I did, while using bbswitch to manage power for the card.

$ sudo tee /proc/acpi/bbswitch <<< ON
ON
$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
$ sudo modprobe nvidia_384
$ sudo modprobe nvidia_384_uvm
$ lsmod | grep nvidia
nvidia_uvm            684032  0
nvidia              12976128  1 nvidia_uvm

And to turn it off:

$ sudo rmmod nvidia_uvm
$ sudo rmmod nvidia
$ lsmod | grep nvidia
$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
 sudo tee /proc/acpi/bbswitch <<< OFF
OFF
$ cat /proc/acpi/bbswitch
0000:01:00.0 OFF

The_last_Doctor · September 20, 2017, 10:55am

I got a similar issue as the original post.
My solution was simply to run the CUDA samples as root

inAbserntia · November 15, 2017, 4:49am

Thanks @poppingtonic, this workaround gets it going for me. For some reason, mine needs this to turn on, but turns off on its own. In my case a script seems to help:

#!/bin/bash
tee /proc/acpi/bbswitch <<< ON
modprobe nvidia_384
modprobe nvidia_384_uvm
sudo -u <username> optirun $1

with replaced accordingly.

I saved this to a file called “withcuda”, making sure the directory is on the PATH. Then

chmod +xxx withcuda

makes it executable. Now I can just run, e.g.,

sudo withcuda deviceQuery

and it works. So far, anyway.

alko22 · November 23, 2017, 11:59pm

If you have intel CPU with GPU integrated and need nvidia GPU only for programing and NOT for the display rendering do the following:
uninstall all cuda drivers
install mesa
press ctrl+alt+F1 → login in command shell
type: sudo service lightdm stop
search for cuda driver 384 run file
install cuda driver run file 384 and choose NO when promt for openGL and Xserver
reboot
download cuDNN libs “tar -xzvf cudnn-9.0-linux-x64-v7.tgz”
copy them to cuda-9-0 dir
" cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h
/usr/local/cuda/lib64/libcudnn*
"
make sure you have added “cuda-9-0” paths to system PATH and LD_LIBRARY_PATH
link cuda libs with ldconfig command

Luxex · May 3, 2018, 4:35pm

hi,
I have a similar problem,
I have 2 GPU configuration Vega64 for displays and 780ti for CUDA.
I managed to install both and check with deviceQuery that CUDA was working properly.
I tried running dQ after restart and got error 35.

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  390.48  Thu Mar 22 00:42:57 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Any ideas how to deal with this issue?

Topic		Replies	Views
TensorFlow cannot find cuDNN [Ubuntu 16.04 + CUDA7.5] CUDA Setup and Installation	12	42437	February 10, 2017
Installation failures(?) despite instructions CUDA Setup and Installation	16	3260	January 26, 2018
Install CUDA-9 on Ubuntu 16.04 with the runfile and pre-installed drivers CUDA Setup and Installation	15	58560	February 28, 2020
CUDA driver version is insufficient for CUDA runtime version CUDA Setup and Installation	14	35105	December 9, 2016
deviceQuery passes and then fails CUDA Setup and Installation	4	2125	July 6, 2016
Ubuntu 14.04: optimus + CUDA Linux	16	43171	March 10, 2016
Install Problem CUDA Programming and Performance	32	12682	December 17, 2009
CUDA 9.1 setup and NVIDIA 390 driver not found on Ubuntu 16.04 CUDA Setup and Installation	14	12258	March 15, 2018
Problems with CUDA 9.1 in Ubuntu 16.04 CUDA Setup and Installation	36	24288	May 15, 2018
Cannot run any CUDA kernels CUDA runtime doesn't recognize NVIDIA GPU CUDA Programming and Performance	26	12260	August 24, 2010

[SOLVED] CUDA 9.0rc and NVIDIA 384.69 but driver version is insufficient for CUDA runtime version

Related topics