CUDA drivers insufficient

So I started another docker, using the last saved image (edited 19.02 with jupyter notebook, before I downgraded to cuda 9.0). I did not use the 2 commands you shared. But somehow my tensorflow works in the jupyter notebook.

Not really sure what’s going on here. Did any of my other edits affect this container?

I could try to do a completely clean one where I start a docker from the original 19.02, install jupyter notebook and the likes, followed by running my code.

Alright, so i started yet another docker from the image and it seems like now, it can run straight from the notebook. Really puzzling why it’s behaving like this. Does the ldconfig affect the DGX system-wide? I used the following workflow to access my work.

  1. NV_GPU=0,2 nvidia-docker run -it --rm -v /home/e0146498/cs6216_project/:/home/workspace/ --name CUDA_TFLOW_TEST_CONTAINER -p 8910:8080 nvcr.io/nvidia/tensorflow:19.02-py3

  2. pip install keras

  3. jupyter notebook --ip=0.0.0.0 --port=8080 --no-browser --allow-root

  4. Access notebook via <IP_ADDRESS>:8910

  5. Run Keras code

Previously, I would be stopped at step 5, where jupyter would complain my CUDA drivers are insufficient. That prompted me asking here in this thread as I could run my Keras code via command-line, using “python code.py”

I used the following workflow to access my work.

Okay great, that’s basically the same as what I’d tried.

I started yet another docker from the image and it seems like now, it can run straight from the notebook. Really puzzling why it’s behaving like this. Does the ldconfig affect the DGX system-wide?

So that’s what I’d seen when attempting to reproduce this as well. I’m not sure what happened in your previous Jupyter session to blow away the LD_LIBRARY_PATH, but I’m glad that restarting has worked for you. I’ll try to add some defensive guards for this in 19.04 anyway just to avoid future problems for people.

Does the ldconfig affect the DGX system-wide?

No, those kinds of user-space changes inside the container are isolated to the container.

Thanks for the feedback, and glad you got it working.

Best,
Cliff

Hi Cliff,

Do you have any idea why it works now though? I’m not really sure why, when I start a new container from the 19.02 image, the compatibility drivers now work. Haha. I’m just wondering as I need the container to be stable and I don’t want it to break halfway. Really curious case though!

Well, it was supposed to work from the beginning, so having it get back to working upon restarting the container is in some sense reassuring. :) Now as to why it had stopped working, I’m afraid I don’t have enough insight into how your Jupyter instance that was the broken one got into the state it was in. If you do happen to run into the issue again, please do let me know.

Alright understood! By the way, is there any guides/diagrams as to which services/programmes/drivers are shared with containers and which aren’t? To my understanding, the GPU drivers are shared with the base system right? Meaning that if we update them in the container, it also causes the base system’s GPU drivers to get updated?

Well, we might have called it too early. So now, I create a new permanent container (without --rm in the previous piece of command I gave you) and I face an issue when importing Keras/tensorflow. Is this affected if someone else is using the GPUs that I wish to use? Either way, I get the following error in jupyter.

Using TensorFlow backend.

ImportError Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py in
57
—> 58 from tensorflow.python.pywrap_tensorflow_internal import *
59 from tensorflow.python.pywrap_tensorflow_internal import version

/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py in
27 return _mod
—> 28 _pywrap_tensorflow_internal = swig_import_helper()
29 del swig_import_helper

/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py in swig_import_helper()
23 try:
—> 24 _mod = imp.load_module(‘_pywrap_tensorflow_internal’, fp, pathname, description)
25 finally:

/usr/lib/python3.5/imp.py in load_module(name, file, filename, details)
241 else:
→ 242 return load_dynamic(name, filename, file)
243 elif type_ == PKG_DIRECTORY:

/usr/lib/python3.5/imp.py in load_dynamic(name, path, file)
341 name=name, loader=loader, origin=path)
→ 342 return _load(spec)
343

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in
----> 1 from Code.datasets import load_mnist, load_fashion_mnist, load_cifar10
2 from time import time
3 import numpy as np
4 import keras.backend as K
5 from keras.engine.topology import Layer, InputSpec

/home/workspace/cs6216_project/Code/datasets.py in
1 import numpy as np
----> 2 from keras.datasets import mnist, fashion_mnist, cifar10
3
4 def load_mnist():
5 # the data, shuffled and split between train and test sets

/usr/local/lib/python3.5/dist-packages/keras/init.py in
1 from future import absolute_import
2
----> 3 from . import utils
4 from . import activations
5 from . import applications

/usr/local/lib/python3.5/dist-packages/keras/utils/init.py in
4 from . import data_utils
5 from . import io_utils
----> 6 from . import conv_utils
7
8 # Globally-importable utils.

/usr/local/lib/python3.5/dist-packages/keras/utils/conv_utils.py in
7 from six.moves import range
8 import numpy as np
----> 9 from … import backend as K
10
11

/usr/local/lib/python3.5/dist-packages/keras/backend/init.py in
87 elif _BACKEND == ‘tensorflow’:
88 sys.stderr.write(‘Using TensorFlow backend.\n’)
—> 89 from .tensorflow_backend import *
90 else:
91 # Try and load external backend.

/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py in
3 from future import print_function
4
----> 5 import tensorflow as tf
6 from tensorflow.python.framework import ops as tf_ops
7 from tensorflow.python.training import moving_averages

/usr/local/lib/python3.5/dist-packages/tensorflow/init.py in
22
23 # pylint: disable=g-bad-import-order
—> 24 from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
25
26 from tensorflow._api.v1 import app

/usr/local/lib/python3.5/dist-packages/tensorflow/python/init.py in
47 import numpy as np
48
—> 49 from tensorflow.python import pywrap_tensorflow
50
51 # Protocol buffers

/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py in
72 for some common reasons and solutions. Include the entire stack trace
73 above this error message when asking for help.“”" % traceback.format_exc()
—> 74 raise ImportError(msg)
75
76 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py”, line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File “/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal’, fp, pathname, description)
File “/usr/lib/python3.5/imp.py”, line 242, in load_module
return load_dynamic(name, filename, file)
File “/usr/lib/python3.5/imp.py”, line 342, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See Build and install error messages  |  TensorFlow

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

By the way, is there any guides/diagrams as to which services/programmes/drivers are shared with containers and which aren’t?

In short, it’s only kernel-mode things that are actually usually shared. As a special exception, with nvidia-docker, the user-mode portions of our driver also are mounted into the container from the bare metal, as the versions of those user- and kernel-mode portions (usually) has to match – with the exception being this compatibility mode we’re talking through here, wherein they don’t necessarily have to exactly match.

File “/usr/lib/python3.5/imp.py”, line 342, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.

Can you show me what command you used to start the container that time? It sounds like maybe you forgot nvidia-docker run? But this is a different symptom than we saw before.

I ran this command to start a new docker

NV_GPU=0,2 nvidia-docker run -it -v /home/<user_id>/project/:/home/workspace/ --name GYL_CUDA10 -p 8910:8080 nvcr.io/nvidia/tensorflow:19.02-py3

From my understanding, my colleague did some other pip installs but using pip install --user (this was done in the base system as he does not use docker). So I’m not sure if that changed anything.

So I confirmed with him that he installed a BERT pre-training for pytorch. But that was done in a virtualenv, so technically it shouldn’t affect anything.

Just by way of update:

We do have a few people internally who are seeing the existing LD_LIBRARY_PATH method of wiring up the compat library fail sporadically, so I’ll be following up with them to try to understand how that’s possible. I strongly suspect it’s the same basic thing you were seeing here.

Meanwhile for our 19.04 release we’ve added in some additional bits that allow falling back to a pre-warmed ld.so.cache entry for the compat library in case LD_LIBRARY_PATH has gotten unset. If you still see this after 19.04 is released in a few weeks, please reopen this thread.

Thanks!