Hello,
I’m trying to run DIGITS 4.0 docker image on an EC2 machine using nvidia-docker.
My EC2 machine has the 361.42 nvidia driver up and running, and nvidia-docker connects to it fine. Using the nvidia/cuda docker I was able to verify with nvidia-smi that a GPU is detected, and the driver version is indeed 361.42:
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 36C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But when running nvidia/digits, I get in the log the following error:
cudaRuntimeGetVersion() failed with error #35
Which seems to mean my driver version is too old for the CUDA runtime. (If I understand correctly)
But 361.42 is a pretty recent release, isn’t it?
DIGITS 4 uses CUDA 7.5.18, according to its /usr/local/cuda/version.txt
Any suggestions?
what OS are you using on that instance?
how did you install the 361.42 driver?
It’s Ubuntu 15.10 (GNU/Linux 4.2.0-42-generic x86_64), this is what I did from the beginning:
$ sudo apt-get update
$ sudo apt-get install --no-install-recommends -y gcc make libc-dev
$ wget -P /tmp http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run
$ sudo sh /tmp/NVIDIA-Linux-x86_64-361.42.run --silent
$ wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
$ sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
$ sudo apt-get install dkms build-essential linux-headers-generic
$ sudo nano /etc/modprobe.d/blacklist-nouveau.conf
adding the following lines:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
save and quit
$echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
$sudo update-initramfs -u
I may have had to re-run the nvidia installer again at this stage. (exactly the same 2 lines as before)
And finally
$sudo usermod -aG docker ubuntu
$sudo service nvidia-docker start
made sure both docker and nvidia-docker-plugin services are up:
$service nvidia-docker status
$service docker status
And as mentioned above, the nvidia/cuda docker is able to run nvidia-smi and show the GPU and driver versions show as expected…
Also might be related:
trying to nvidia-docker build a dockerfile based on nvidia/cuda:7.0-cudnn4-devel-ubuntu14.04 which clones the master branch of caffe and compiles it with cudnn enabled fails on the beginning of testing with the following error:
Cuda number of devices: 0
Setting to use device 0
Current device id: 0
Current device name:
Note: Randomizing tests' orders with a seed of 21847 .
[==========] Running 2081 tests from 277 test cases.
[----------] Global test environment set-up.
[----------] 50 tests from NeuronLayerTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN ] NeuronLayerTest/3.TestSigmoidGradient
E0905 10:18:15.161348 263 common.cpp:113] Cannot create Cublas handle. Cublas won't be available.
E0905 10:18:15.162796 263 common.cpp:120] Cannot create Curand generator. Curand won't be available.
F0905 10:18:15.162914 263 syncedmem.hpp:18] Check failed: error == cudaSuccess (35 vs. 0) CUDA driver version is insufficient for CUDA runtime version
But running
nvidia-docker run -d -p 8080:8080 -v /home/ubuntu/data:/data beniz/deepdetect_gpu
does seem to work… It uses nvidia/cuda:7.5-cudnn4-devel as base…
what is the output of:
sudo dmesg |grep NVRM
in the base OS (i.e. not in a docker container)
also, not sure, but installing docker before completing the driver install steps (e.g. blacklist of nouveau) is something that caught my eye.
[ 4.221525] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 361.42 Tue Mar 22 18:10:58 PDT 2016
installing docker before completing the driver install steps did cause a problem with getting nvidia-docker service to start, which is why I had to start it after the driver installation.
You might have missed it, since I just recently edited my post above with new information:
nvidia-docker run -d -p 8080:8080 -v /home/ubuntu/data:/data beniz/deepdetect_gpu
nvidia-docker exec -ti 3b091aba4bd7 bash -c “export PATH=$PATH:/opt/deepdetect/build/caffe_dd/src/caffe_dd/.build_release/tools && cd /data && caffe train -solver SOLVER.prototxt -weights my-start.caffemodel”
with SOLVER having solver_mode: GPU in it.
does seem to recognize the GPU and execute on it:
INFO - 21:01:01 - Using GPUs 0
INFO - 21:01:01 - GPU 0: GRID K520
This docker uses nvidia/cuda:7.5-cudnn4-devel as base…
I’m at a loss as to what’s wrong with the DIGITS docker, or the one I wrote… I’d really like to make them work, as the deepdetect docker lacks python binding or a decent interface, making me resort to using caffe in cli…