CUDA drivers insufficient

Hi there, I’m using the latest tensorflow nvdia docker image nvcr.io/nvidia/tensorflow:19.02-py3 .
I’m trying to train some models in Python using tensorflow and Keras. When I execute my code via “python some_file.py”, python is able to build the model and train it.

However, if I declare model in jupyter notebook and try to train it, I get the error “InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version”. This is quite odd as I thought the python kernel should be pointing to the same python that was used to execute the some_file.py and hence since that works, this should as well. Can anyone help me on this please?

A few more details would help here:

  • Which driver do you have installed? cat /proc/driver/nvidia/version
  • What is the output of nvidia-smi ?
  • What is the output of echo $LD_LIBRARY_PATH ?
  • What is the output of echo $_CUDA_COMPAT_STATUS ?

Thanks,
Cliff

Hi Cliff,

It turns out that the NVIDIA drivers are indeed outdated, they are at 384 while the cuda driver in the container is 10.0. Now, this makes me curious, why am I able to run the models using the “python some_file.py” then? On the other hand, when we use a nvidia-docker, does this handle the GPU drivers as well?

As for the outputs you need, here they are:

root@50105e43ccca:/workspace# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.145 Thu May 17 21:47:37 PDT 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)

root@50105e43ccca:/workspace# nvidia-smi
Sun Mar 17 08:13:57 2019
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145 Driver Version: 384.145 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS… Off | 00000000:08:00.0 Off | 0 |
| N/A 40C P0 52W / 300W | 428MiB / 32499MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-DGXS… Off | 00000000:0F:00.0 Off | 0 |
| N/A 39C P0 51W / 300W | 428MiB / 32499MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

root@50105e43ccca:/workspace# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/lib/tensorflow

root@50105e43ccca:/workspace# echo $_CUDA_COMPAT_STATUS
CUDA Driver OK

Also, note that I am using this docker instance in a machine that has CUDA 9.0 and GPU driver 384. Could that affect my python execution (though I don’t think so as the docker is supposed to be independent right?)?

Ok great, so this is the CUDA Driver’s compatibility mode ( https://docs.nvidia.com/deploy/cuda-compatibility/index.html ), which is included in that container, working as intended – at least from the command line. DGX OS 4.x does have a later bare metal driver version (410.xx), but you’re apparently still on 3.x (no problem, that’s intended to work).

My hunch is that LD_LIBRARY_PATH isn’t getting picked up by the Jupyter kernel, which then in turn disconnects the compat driver. (I might be able to fix this in 19.04; I’ll give it a try. TBH I never thought of that scenario before :). As a workaround, can you please try starting up Jupyter this way, as in https://stackoverflow.com/questions/37890898/how-to-set-env-variable-in-jupyter-notebook ?

env LD_LIBRARY_PATH=$LD_LIBRARY_PATH jupyter notebook

Just to be sure I understand how to reproduce this: are you running jupyter notebook directly from the docker run command line, or do you run a shell in the container and then start Jupyter from inside that shell? Could you share your exact command line you use in either case?

Sure. Let me share with you the workflow and I’ll attempt the above solution that you have shared.

I keep the docker alive all the time as I want my jupyter environment to be constantly alive (to hold onto variables so that I don’t need to re-build models). The docker is accessed via ‘NV_GPU=1,3 nvidia-docker exec -it GYL_CS6216 /bin/bash’. I exit the docker each time using CTRL+P, CTRL+Q.

For the first time round setup, the docker was ran using: NV_GPU=1,3 nvidia-docker run -it GYL_CS6216 /bin/bash -v /home/user/project:/home/workspace/project , with port forwarding of 8080 -> 8910. I vaguely remember the docker run command as this as I can’t seem to find it in my bash history at the moment.

So, I am running a docker, I enter the docker with bash, cd to the correct directory, screen, run a jupyter notebook, detach screen, then access jupyter through port 8910. In the case of running large models, I run it using python command-line, exit from the docker (CTRL+P, CTRL+Q), and come back later on to view the results (I pipe results into a text file). Now, I’d like to experiment with other models and I get thrown the error as shared when I try to execute from inside the notebook.

I’ve tried the above solution by adding “%env LD_LIBRARY_PATH=$LD_LIBRARY_PATH jupyter notebook” in the jupyter notebook but it seems that I still face the same issue.

Is it possible for me to update my GPU drivers in the docker, but not in the base system? Meaning that my machine will still have 384 while the docker is updated to 410?

I tried installing CUDA 9.0 but it seems like Jupyter still looks for cuda 10.0, despite me moving the cuda-10-0 folder to another location. Not really sure how to get out of this haha!

Is it possible for me to update my GPU drivers in the docker, but not in the base system?

So actually this is more or less what the compatibility package included in the container is trying to provide for you. Sorry for the delay replying; I haven’t yet figured out exactly how to reproduce the scenario you described (thanks for the details though). Ultimately what we need to do is get the compatibility driver libcuda.so that’s in the LD_LIBRARY_PATH (specifically, it’s in /usr/local/cuda/compat/lib ) in the container also to be seen by your Jupyter kernel. I’m thinking that maybe we can update /etc/ld.so.conf.d/* to accomplish this but that approach has other complications. Can you show what the environment is from the perspective of the Jupyter kernel? Something like print(os.environ) should help shed some light.

Hi Cliff,

So what I’ve done so far is pretty interesting, haha!

I’ve uninstalled CUDA 10.0, installed CUDA 9.0, pointed cuda folder -> cuda-9.0. I’ve also uninstalled tensorflow1.13, installed tensorflow 1.12.

Thus far, tensorflow can run “hello world”, however, it can’t run the convolution functions (complains that probably cudNN not initialized).

So I’ve gone and downloaded cudNN7.5.0, stored them into the cuda-9.0 folder. But it seems that tensorflow still complains of the cudNN.h file.

As for the environment variables, I’m going to post what I see now, from a CUDA9.0, tensorflow1.12 perspective.

{‘BASH_ENV’: ‘/etc/bash.bashrc’,
‘BAZELRC’: ‘/root/.bazelrc’,
‘CLICOLOR’: ‘1’,
‘CUBLAS_VERSION’: ‘10.0.130’,
‘CUDA_CACHE_DISABLE’: ‘1’,
‘CUDA_DRIVER_VERSION’: ‘410.48’,
‘CUDA_TOOLKIT_PATH’: ‘/usr/local/cuda’,
‘CUDA_VERSION’: ‘10.0.130’,
‘CUDNN_INSTALL_PATH’: ‘/usr/lib/x86_64-linux-gnu’,
‘CUDNN_VERSION’: ‘7.4.2.24’,
‘ENV’: ‘/etc/shinit’,
‘GIT_PAGER’: ‘cat’,
‘HOME’: ‘/root’,
‘HOROVOD_GPU_ALLREDUCE’: ‘NCCL’,
‘HOROVOD_NCCL_INCLUDE’: ‘/usr/include’,
‘HOROVOD_NCCL_LIB’: ‘/usr/lib/x86_64-linux-gnu’,
‘HOROVOD_NCCL_LINK’: ‘SHARED’,
‘HOROVOD_WITHOUT_PYTORCH’: ‘1’,
‘HOSTNAME’: ‘50105e43ccca’,
‘JPY_PARENT_PID’: ‘1222’,
‘LC_ALL’: ‘C.UTF-8’,
‘LESSCLOSE’: ‘/usr/bin/lesspipe %s %s’,
‘LESSOPEN’: ‘| /usr/bin/lesspipe %s’,
‘LIBRARY_PATH’: ‘/usr/local/cuda/lib64/stubs:’,
‘LS_COLORS’: ‘rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.jpg=01;35:.jpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:’,
‘MOFED_VERSION’: ‘4.4-1.0.0’,
‘MPLBACKEND’: ‘module://ipykernel.pylab.backend_inline’,
‘NCCL_HDR_PATH’: ‘/usr/include’,
‘NCCL_INSTALL_PATH’: ‘/usr/lib/x86_64-linux-gnu’,
‘NCCL_VERSION’: ‘2.3.7’,
‘NVIDIA_BUILD_ID’: ‘5618942’,
‘NVIDIA_DRIVER_CAPABILITIES’: ‘compute,utility,video’,
‘NVIDIA_REQUIRE_CUDA’: ‘cuda>=9.0’,
‘NVIDIA_TENSORFLOW_VERSION’: ‘19.02’,
‘NVIDIA_VISIBLE_DEVICES’: ‘all’,
‘OLDPWD’: ‘/home/workspace/cs6216_project’,
‘OMPI_MCA_btl_vader_single_copy_mechanism’: ‘none’,
‘OPENMPI_VERSION’: ‘3.1.3’,
‘PAGER’: ‘cat’,
‘PATH’: ‘/usr/local/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin’,
‘PWD’: ‘/home/workspace’,
‘PYTHONIOENCODING’: ‘utf-8’,
‘SHELL’: ‘bash’,
‘SHLVL’: ‘2’,
‘STY’: ‘81.pts-1.50105e43ccca’,
‘TENSORFLOW_VERSION’: ‘v1.13.0-rc0’,
‘TERM’: ‘xterm-color’,
‘TERMCAP’: ‘SC|screen|VT 100/ANSI X3.64 virtual terminal:\\n\t:DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\\n\t:cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\\n\t:do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\\n\t:le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\\n\t:li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\\n\t:cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\\n\t:im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\\n\t:ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\\n\t:ti=\E[?1049h:te=\E[?1049l:us=\E[4m:ue=\E[24m:so=\E[3m:\\n\t:se=\E[23m:mb=\E[5m:md=\E[1m:mh=\E[2m:mr=\E[7m:\\n\t:me=\E[m:ms:\\n\t:Co#8:pa#64:AF=\E[3%dm:AB=\E[4%dm:op=\E[39;49m:AX:\\n\t:vb=\Eg:G0:as=\E(0:ae=\E(B:\\n\t:ac=\140\140aaffggjjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~…–++,hhII00:\\n\t:po=\E[5i:pf=\E[4i:Km=\E[M:k0=\E[10~:k1=\EOP:k2=\EOQ:\\n\t:k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:\\n\t:k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:F2=\E[24~:\\n\t:F3=\E[1;2P:F4=\E[1;2Q:F5=\E[1;2R:F6=\E[1;2S:\\n\t:F7=\E[15;2~:F8=\E[17;2~:F9=\E[18;2~:FA=\E[19;2~:kb=\x7f:\\n\t:K2=\EOE:kB=\E[Z:kF=\E[1;2B:kR=\E[1;2A:*4=\E[3;2~:\\n\t:*7=\E[1;2F:#2=\E[1;2H:#3=\E[2;2~:#4=\E[1;2D:%c=\E[6;2~:\\n\t:%e=\E[5;2~:%i=\E[1;2C:kh=\E[1~:@1=\E[1~:kH=\E[4~:\\n\t:@7=\E[4~:kN=\E[6~:kP=\E[5~:kI=\E[2~:kD=\E[3~:ku=\EOA:\\n\t:kd=\EOB:kr=\EOC:kl=\EOD:km:’,
‘TF_ADJUST_HUE_FUSED’: ‘1’,
‘TF_ADJUST_SATURATION_FUSED’: ‘1’,
‘TF_AUTOTUNE_THRESHOLD’: ‘2’,
‘TF_ENABLE_WINOGRAD_NONFUSED’: ‘1’,
‘TRT_VERSION’: ‘5.0.2.6’,
‘WINDOW’: ‘0’,
‘_’: ‘/usr/local/bin/jupyter’,
‘_CUDA_COMPAT_PATH’: ‘/usr/local/cuda/compat’,
‘_CUDA_COMPAT_STATUS’: ‘CUDA Driver OK’}

Funny thing is, despite all the uninstall/reinstalling, jupyter’s environment variables still point to cuda 10.0. Though I am quite sure that since I removed cuda10.0 and put cuda9.0, as well as established the symlink from cuda -> cuda9.0, I should be on cuda9.0 right now.

So wait, is all this uninstalling and reinstalling happening inside the 19.02 container? That’s going to be a bit unpredictable. So for now I’m going to assume we’re talking about a TF 19.02 container that’s more or less the same as what we delivered, and that the experiments above referred to the bare metal. (?)

Interestingly the environment you showed there has everything I’d expect except for LD_LIBRARY_PATH. So where that one got off to, we’ll need to figure out. But if you can set it inside your notebook with os.environ (before importing TF or other things that touch the GPU) to the same thing as it said from the shell in your container listed above, then maybe we’ll be getting somewhere.

Well, I’ve just installed cudNN via the Dev and Runtime deb files and everything seems to work. Haha. I guess I just fully downgraded to CUDA9.0. But is there a solution for this either way, where we can use the compatibility version given that the base system’s GPU drivers are outdated.

Yes it’s all happening inside the 19.02 container. I’m not sure if that’s a good solution but at the moment, that’s what I’m doing.

I still do have a copy of my 19.02 container before I uninstalled anything (I committed a copy) so I can always revert back to that. I think the main issue here is still that jupyter can’t seem to find the compatibility drivers to for the older GPU drivers.

Yes it’s all happening inside the 19.02 container. I’m not sure if that’s a good solution but at the moment, that’s what I’m doing.

Please do revert back – the only way we can really support those containers is if they are the software stack as we’ve delivered them. (That’s the whole reason for the containerization, actually.) Plus, the upstream TensorFlow build and the one that’s in our container are not the same – the one in the container is somewhat customized by NVIDIA.

Meanwhile I at least better understand now why you were seeing this error with both the original NVIDIA-built TensorFlow (an 1.13rc built on CUDA 10.0) that came with 19.02 as well as with the upstream Google-built TensorFlow 1.12 (built on CUDA 9.0): in the container running under nvidia-docker1, which is what you have on DGX OS 3.x, you need LD_LIBRARY_PATH set for not only the compat driver but also the bare-metal driver.

So getting that variable set back properly and visible to the python that runs your jupyter notebook is really key. But from what I’ve found so far, it seems that by the time the notebook is started, it’s already too late to change that variable – it is what it is. So ld.so.conf.d might be our best option if indeed after you get started all back up again it still has vanished. Give it a try with a clean-slate start back from your committed image and let’s see where that lands us.

Hi Cliff, could you then guide me on what to do for the ld.so.conf.d? I’m going to start a new instance based on my saved docker image now.

Before you change anything, let’s see if the jupyter you start up from the shell inside the container this time behaves any differently than the one you had before. (And if not, then this time you’ll be able to tell me exactly what the sequence of events was between when you started the container and when you executed jupyter [and with what command line flags you executed it], which might help me reproduce the problem.)

Assuming it does show the same behavior, though, then what I’m thinking is something like this, from the shell inside the container. Note you’ll need to exit Jupyter, do the following, and then start Jupyter back up again after:

echo /usr/local/cuda/compat/lib > /etc/ld.so.conf.d/00-cuda.conf
ldconfig

It seems to work now! I can run my code as per the cuda 9.0 version. Is there any other tests to check if this is stable?

as per the cuda 9.0 version

… Meaning the original 10.0-based build also now works, right?

Did you have to do the ld.so.conf.d change, or did Jupyter work right away from the clean image?

Yes for the original 10.0 based build. I tried it after the kd.so.conf.d change. I didn’t test it with Jupiter right away. I’ll test it again later once I’m back at my desk. Will keep you updated on it.