Getting cudaRuntimeGetVersion() failed with error #35 for CUDA Version 7.5.18 with 361.42 driver

Hello,

I’m trying to run DIGITS 4.0 docker image on an EC2 machine using nvidia-docker.

My EC2 machine has the 361.42 nvidia driver up and running, and nvidia-docker connects to it fine. Using the nvidia/cuda docker I was able to verify with nvidia-smi that a GPU is detected, and the driver version is indeed 361.42:

+------------------------------------------------------+                       
| NVIDIA-SMI 361.42     Driver Version: 361.42         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   36C    P8    17W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But when running nvidia/digits, I get in the log the following error:
cudaRuntimeGetVersion() failed with error #35

Which seems to mean my driver version is too old for the CUDA runtime. (If I understand correctly)
But 361.42 is a pretty recent release, isn’t it?

DIGITS 4 uses CUDA 7.5.18, according to its /usr/local/cuda/version.txt

Any suggestions?

what OS are you using on that instance?

how did you install the 361.42 driver?

It’s Ubuntu 15.10 (GNU/Linux 4.2.0-42-generic x86_64), this is what I did from the beginning:

sudo apt-get update sudo apt-get install --no-install-recommends -y gcc make libc-dev
wget -P /tmp http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --silent
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
sudo apt-get install dkms build-essential linux-headers-generic sudo nano /etc/modprobe.d/blacklist-nouveau.conf

adding the following lines:

blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

save and quit

$echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$sudo update-initramfs -u

I may have had to re-run the nvidia installer again at this stage. (exactly the same 2 lines as before)

And finally
$sudo usermod -aG docker ubuntu
$sudo service nvidia-docker start

made sure both docker and nvidia-docker-plugin services are up:

$service nvidia-docker status
$service docker status

And as mentioned above, the nvidia/cuda docker is able to run nvidia-smi and show the GPU and driver versions show as expected…

Also might be related:

trying to nvidia-docker build a dockerfile based on nvidia/cuda:7.0-cudnn4-devel-ubuntu14.04 which clones the master branch of caffe and compiles it with cudnn enabled fails on the beginning of testing with the following error:

Cuda number of devices: 0
Setting to use device 0
Current device id: 0
Current device name: 
Note: Randomizing tests' orders with a seed of 21847 .
[==========] Running 2081 tests from 277 test cases.
[----------] Global test environment set-up.
[----------] 50 tests from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN      ] NeuronLayerTest/3.TestSigmoidGradient
E0905 10:18:15.161348   263 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
E0905 10:18:15.162796   263 common.cpp:120] Cannot create Curand generator. Curand won't be available.
F0905 10:18:15.162914   263 syncedmem.hpp:18] Check failed: error == cudaSuccess (35 vs. 0)  CUDA driver version is insufficient for CUDA runtime version

But running

nvidia-docker run -d -p 8080:8080 -v /home/ubuntu/data:/data beniz/deepdetect_gpu

does seem to work… It uses nvidia/cuda:7.5-cudnn4-devel as base…

what is the output of:

sudo dmesg |grep NVRM

in the base OS (i.e. not in a docker container)

also, not sure, but installing docker before completing the driver install steps (e.g. blacklist of nouveau) is something that caught my eye.

[    4.221525] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  361.42  Tue Mar 22 18:10:58 PDT 2016

installing docker before completing the driver install steps did cause a problem with getting nvidia-docker service to start, which is why I had to start it after the driver installation.

You might have missed it, since I just recently edited my post above with new information:

nvidia-docker run -d -p 8080:8080 -v /home/ubuntu/data:/data beniz/deepdetect_gpu
nvidia-docker exec -ti 3b091aba4bd7 bash -c “export PATH=$PATH:/opt/deepdetect/build/caffe_dd/src/caffe_dd/.build_release/tools && cd /data && caffe train -solver SOLVER.prototxt -weights my-start.caffemodel”

with SOLVER having solver_mode: GPU in it.

does seem to recognize the GPU and execute on it:

INFO - 21:01:01 - Using GPUs 0
INFO - 21:01:01 - GPU 0: GRID K520

This docker uses nvidia/cuda:7.5-cudnn4-devel as base…

I’m at a loss as to what’s wrong with the DIGITS docker, or the one I wrote… I’d really like to make them work, as the deepdetect docker lacks python binding or a decent interface, making me resort to using caffe in cli…