Dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory;

Hi,

I am trying to create my own Docker container for Tensorflow on the GPU.

My base is:

FROM nvidia/cuda:10.1-base-ubuntu18.04
LABEL authors=“Lukas Heumos”
description=“Docker image containing all requirements for running machine learning on CUDA enabled GPUs”

Install some basic utilities

RUN apt-get update && apt-get install -y
curl
wget
ca-certificates
sudo
git
bzip2
libx11-6
&& rm -rf /var/lib/apt/lists/*

Create a working directory and set it as default

RUN mkdir /app
RUN chmod 777 /app
WORKDIR /app

Create a non-root user and switch to it

RUN adduser --disabled-password --gecos ‘’ --shell /bin/bash user
RUN echo “user ALL=(ALL) NOPASSWD:ALL” > /etc/sudoers.d/90-user
USER user

All users can use /home/user as their home directory

ENV HOME=/home/user
RUN chmod 777 /home/user

Install Miniconda

RUN curl -so ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
&& chmod +x ~/miniconda.sh
&& ~/miniconda.sh -b -p ~/miniconda
&& rm ~/miniconda.sh
ENV PATH=/home/user/miniconda/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

Update Conda first

RUN conda update conda

And my tensorflow container is:

From mlflowcore/base:1.0.0

Install the conda environment

COPY tensorflow_environment.yml .
RUN conda env create -f tensorflow_environment.yml && conda clean -a

Activate the environment

RUN echo “source activate tensorflow-2.1-cuda-10.1” > ~/.bashrc
ENV PATH /opt/conda/envs/env/bin:$PATH

Dump the details of the installed packages to a file for posterity

RUN conda env export --name tensorflow-2.1-cuda-10.1 > tensorflow-2.1-cuda-10.1.yml

with the environment.yml:

name: tensorflow-2.1-cuda-10.1
channels:
- conda-forge
- defaults
dependencies:
- defaults::cudatoolkit=10.1
#- defaults::tensorflow=2.1.0 → distribute.MirroredStrategy API changed in 2.2 → Custom training with tf.distribute.Strategy  |  TensorFlow Core
- conda-forge::graphviz=2.40.1
- conda-forge::python-graphviz=0.13.2
- pip
- pip:
- tensorflow==2.2.0rc2

However, when trying to run stuff on the GPU I get:

2020-04-02 09:39:54.522822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-02 09:39:54.570821: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-02 09:39:54.572960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-02 09:39:54.573491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-02 09:39:54.577211: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-02 09:39:54.578817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-02 09:39:54.579250: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcudnn.so.7’; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2020-04-02 09:39:54.579268: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Skipping registering GPU devices…

Why does it not find that file? Where is it?

Help would be highly appreciated.
Thank you very much!
Best

They do exist somewhere:

/home/user/miniconda/envs/tensorflow-2.1-cuda-10.1/lib/libcudnn.so.7
/home/user/miniconda/pkgs/cudnn-7.6.5-cuda10.1_0/lib/libcudnn.so.7

Hi,

Please refer to below link, in case it helps:

Thanks

Thank you. So I added:

ENV LD_LIBRARY_PATH “$LD_LIBRARY_PATH:/usr/local/cuda/lib64”
ENV LD_LIBRARY_PATH “$LD_LIBRARY_PATH:/usr/local/cuda-10.1/compat”

And now get:

2020-04-03 08:50:37.180018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-03 08:50:37.181533: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
2020-04-03 08:50:37.181612: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 64683ab6f51a
2020-04-03 08:50:37.181642: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 64683ab6f51a
2020-04-03 08:50:37.181816: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
2020-04-03 08:50:37.181877: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.64.0
2020-04-03 08:50:37.181898: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.64.0 does not match DSO version 418.87.1 – cannot find working devices in this configuration

So there seems to be a mismatch of the CUDA and Kernel version? How can I resolve this in a Docker container?
Not quite sure how to proceed here and where to download/update to the correct versions?

Most of the solutions I found so far suggest to reboot as well, but this is not really possible in a Docker container, right?

Edit: apt-get install cuda-compat-10.2
This also doesn’t seem to upgrade libcuda?

Edit2: Updating
FROM nvidia/cuda:10.1-base-ubuntu18.04
to
FROM nvidia/cuda:10.2-base-ubuntu18.04
doesn’t seem to help. I also changed the cudatoolkit version to 10.2, but with no success?

@SunilJB

I am sorry for pinging you directly, but do you have any pointers?
I think that this is actually a solvable problem now, since it’s just a version missmatch (that I do not quite know how to resolve).

Thank you very much!

Hi,

Please refer to below support matrix and make sure Linux kernel version is compatible with your CUDA version:

Also refer to below topic, in case it helps:

Thanks

I’ve been trying various combinations now and none seems to work.
The official nvidia/cuda:10.1-base-ubuntu18.04 container together with cudatoolkit=10.1 should have the corresponding versions, no?

Which combinations should work?

I’ve taken a look at your table already various times, but none of the combinations seem to work for me.

Hi,

Could you please check if the driver version installed on host system is satisfying the support matrix?

Thanks

All right, I think I solved it.

This is some really weird shit:

so environment.yml to install that WORKS:

name: tensorflow-2.1-cuda-10.1
channels:
    - conda-forge
    - defaults
dependencies:
    - defaults::cudatoolkit=10.1
    # We need to install tensorflow-gpu, since else we get:
    # 2020-04-10 13:08:01.736473: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
    # 2020-04-10 13:08:01.736534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.82.0
    # 2020-04-10 13:08:01.736556: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.82.0 does not match DSO version 418.87.1 -- cannot find working devices in this configuration
    - defaults::tensorflow-gpu=2.1.0
    - conda-forge::graphviz=2.40.1
    - conda-forge::python-graphviz=0.13.2
    - pip
    - pip:
      - tensorflow==2.2.0rc2

Environment that does NOT work:

name: tensorflow-2.1-cuda-10.1
channels:
    - conda-forge
    - defaults
dependencies:
    - defaults::cudatoolkit=10.1
    - conda-forge::graphviz=2.40.1
    - conda-forge::python-graphviz=0.13.2
    - pip
    - pip:
      - tensorflow==2.2.0rc2

So apparently I need to install tensorflow-gpu, which is actually deprecated for tensorflow (standalone), which supposedly contains GPU support.
Is this a tensorflow issue or is that an issue in your container?

Cheers