Can't get cuda:10.0 docker container to run with tensorflow-gpu

robert.harris · March 4, 2020, 5:48pm

Hi, I am using a standardi dockerfile from nvidia, so my dockerfile starts like this:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04

which should install cuda 10.0. I also have a couple flags in the dockerfile to install the specific keras and tensorflow-gpu versions I want:

RUN pip install keras==2.2.4 tensorflow-gpu==1.11.0

however, after the container is built and I try to run it, I get this error:

Traceback (most recent call last):
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py”, line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal’, fp, pathname, description)
File “/root/miniconda3/lib/python3.6/imp.py”, line 243, in load_module
return load_dynamic(name, filename, file)
File “/root/miniconda3/lib/python3.6/imp.py”, line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/arbiter.py”, line 583, in spawn_worker
worker.init_process()
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/workers/base.py”, line 129, in init_process
self.load_wsgi()
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/workers/base.py”, line 138, in load_wsgi
self.wsgi = self.app.wsgi()
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/app/base.py”, line 67, in wsgi
self.callable = self.load()
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py”, line 52, in load
return self.load_wsgiapp()
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/app/wsgiapp.py”, line 41, in load_wsgiapp
return util.import_app(self.app_uri)
File “/root/miniconda3/lib/python3.6/site-packages/gunicorn/util.py”, line 350, in import_app
import(module)
File “/app/inference/common_run.py”, line 1, in
from app import app
File “/app/inference/app/init.py”, line 7, in
from app.inference_classification import inference_classification
File “/app/inference/app/inference_classification.py”, line 6, in
import keras
File “/root/miniconda3/lib/python3.6/site-packages/keras/init.py”, line 3, in
from . import utils
File “/root/miniconda3/lib/python3.6/site-packages/keras/utils/init.py”, line 6, in
from . import conv_utils
File “/root/miniconda3/lib/python3.6/site-packages/keras/utils/conv_utils.py”, line 9, in
from … import backend as K
File “/root/miniconda3/lib/python3.6/site-packages/keras/backend/init.py”, line 89, in
from .tensorflow_backend import *
File “/root/miniconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py”, line 5, in
import tensorflow as tf
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/init.py”, line 22, in
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/init.py”, line 49, in
from tensorflow.python import pywrap_tensorflow
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py”, line 74, in
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py”, line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File “/root/miniconda3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py”, line 24, in swig_import_helper
_mod = imp.load_module(‘_pywrap_tensorflow_internal’, fp, pathname, description)
File “/root/miniconda3/lib/python3.6/imp.py”, line 243, in load_module
return load_dynamic(name, filename, file)
File “/root/miniconda3/lib/python3.6/imp.py”, line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

with the main part of the error, I assume, being the “ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory” part. I’m not sure why it’s looking for cuda 9.0 when the original container is clearly pulled from the cuda:10.0 default? If I run nvidia-smi on the machine that the container is supposed to run on, it looks fine:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

so I’m not sure what’s going wrong here. Any help would be much appreciated!

Note - on a seperate non-docker GPU box that the GPU is working for, I have keras=2.2.4, tensorflow-gpu=1.11.0, and cuda=10.0 which is why I think that combination should work here as well.

nluehr · March 4, 2020, 6:32pm

The tensorflow-gpu==1.11.0 package was built against CUDA 9.0 and requires the system to provide the appropriate libs.

For the non-docker system, is it possible that CUDA 9.0 is installed along side 10.0? On this system, check your tensorflow output for lines something like “Successfully opened libcublas.so.X”. If X is 9.0 then it is finding CUDA 9 libs somewhere.

robert.harris · March 4, 2020, 6:42pm

Thank you for your response! You are right, on my non-docker system, it appears there are actually 3 versions of cuda installed:

(base) VRC\robert.harris@devaitrn01:/usr/local$ ls
cuda cuda-10.0 cuda-10.1 cuda-9.0 etc games include lib man openssl sbin share src

This must have happened when I was trying many different things to set up the GPU initially. It works fine now so I hadn’t touched the setup in awhile.

If tensorflow=1.11 requires CUDA 9 to run properly, is there any way to have it install CUDA 9 via a dockerfile command after the initial container is pulled from nvidia? (via FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04)
I ask because the nvidia/cuda:10.0 part is at the start of our reference container, from which dozens of other containers are branched. If I could just add CUDA 9 via a seperate command line to my one test container for now, that would be ideal. If not, should I simply change the beginning of my reference container to FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 ?

Thanks much again,
Robert

nluehr · March 4, 2020, 8:28pm

It sounds like in your case changing the FROM line to nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 would be the cleanest solution. You could also update TF, I believe 1.15 uses CUDA 10.1.