Dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory;

lukas.heumos · April 2, 2020, 10:09am

Hi,

I am trying to create my own Docker container for Tensorflow on the GPU.

My base is:

FROM nvidia/cuda:10.1-base-ubuntu18.04
LABEL authors=“Lukas Heumos”
description=“Docker image containing all requirements for running machine learning on CUDA enabled GPUs”

Install some basic utilities

RUN apt-get update && apt-get install -y
curl
wget
ca-certificates
sudo
git
bzip2
libx11-6
&& rm -rf /var/lib/apt/lists/*

Create a working directory and set it as default

RUN mkdir /app
RUN chmod 777 /app
WORKDIR /app

Create a non-root user and switch to it

RUN adduser --disabled-password --gecos ‘’ --shell /bin/bash user
RUN echo “user ALL=(ALL) NOPASSWD:ALL” > /etc/sudoers.d/90-user
USER user

All users can use /home/user as their home directory

ENV HOME=/home/user
RUN chmod 777 /home/user

Install Miniconda

RUN curl -so ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
&& chmod +x ~/miniconda.sh
&& ~/miniconda.sh -b -p ~/miniconda
&& rm ~/miniconda.sh
ENV PATH=/home/user/miniconda/bin:$PATH
ENV CONDA_AUTO_UPDATE_CONDA=false

Update Conda first

RUN conda update conda

And my tensorflow container is:

From mlflowcore/base:1.0.0

Install the conda environment

COPY tensorflow_environment.yml .
RUN conda env create -f tensorflow_environment.yml && conda clean -a

Activate the environment

RUN echo “source activate tensorflow-2.1-cuda-10.1” > ~/.bashrc
ENV PATH /opt/conda/envs/env/bin:$PATH

Dump the details of the installed packages to a file for posterity

RUN conda env export --name tensorflow-2.1-cuda-10.1 > tensorflow-2.1-cuda-10.1.yml

with the environment.yml:

name: tensorflow-2.1-cuda-10.1
channels:
- conda-forge
- defaults
dependencies:
- defaults::cudatoolkit=10.1
#- defaults::tensorflow=2.1.0 → distribute.MirroredStrategy API changed in 2.2 → Custom training with tf.distribute.Strategy | TensorFlow Core
- conda-forge::graphviz=2.40.1
- conda-forge::python-graphviz=0.13.2
- pip
- pip:
- tensorflow==2.2.0rc2

However, when trying to run stuff on the GPU I get:

2020-04-02 09:39:54.522822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-02 09:39:54.570821: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-02 09:39:54.572960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-02 09:39:54.573491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-02 09:39:54.577211: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-02 09:39:54.578817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-02 09:39:54.579250: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcudnn.so.7’; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
2020-04-02 09:39:54.579268: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Skipping registering GPU devices…

Why does it not find that file? Where is it?

Help would be highly appreciated.
Thank you very much!
Best

lukas.heumos · April 2, 2020, 10:26am

They do exist somewhere:

/home/user/miniconda/envs/tensorflow-2.1-cuda-10.1/lib/libcudnn.so.7
/home/user/miniconda/pkgs/cudnn-7.6.5-cuda10.1_0/lib/libcudnn.so.7

SunilJB · April 3, 2020, 5:56am

Hi,

Please refer to below link, in case it helps:

Thanks

lukas.heumos · April 3, 2020, 8:51am

Thank you. So I added:

ENV LD_LIBRARY_PATH “$LD_LIBRARY_PATH:/usr/local/cuda/lib64”
ENV LD_LIBRARY_PATH “$LD_LIBRARY_PATH:/usr/local/cuda-10.1/compat”

And now get:

2020-04-03 08:50:37.180018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-03 08:50:37.181533: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
2020-04-03 08:50:37.181612: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 64683ab6f51a
2020-04-03 08:50:37.181642: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 64683ab6f51a
2020-04-03 08:50:37.181816: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
2020-04-03 08:50:37.181877: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.64.0
2020-04-03 08:50:37.181898: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.64.0 does not match DSO version 418.87.1 – cannot find working devices in this configuration

lukas.heumos · April 3, 2020, 8:55am

So there seems to be a mismatch of the CUDA and Kernel version? How can I resolve this in a Docker container?
Not quite sure how to proceed here and where to download/update to the correct versions?

Most of the solutions I found so far suggest to reboot as well, but this is not really possible in a Docker container, right?

Edit: apt-get install cuda-compat-10.2
This also doesn’t seem to upgrade libcuda?

Edit2: Updating
FROM nvidia/cuda:10.1-base-ubuntu18.04
to
FROM nvidia/cuda:10.2-base-ubuntu18.04
doesn’t seem to help. I also changed the cudatoolkit version to 10.2, but with no success?

lukas.heumos · April 10, 2020, 8:12am

@SunilJB

I am sorry for pinging you directly, but do you have any pointers?
I think that this is actually a solvable problem now, since it’s just a version missmatch (that I do not quite know how to resolve).

Thank you very much!

SunilJB · April 10, 2020, 9:17am

Hi,

Please refer to below support matrix and make sure Linux kernel version is compatible with your CUDA version:

Also refer to below topic, in case it helps:

Thanks

lukas.heumos · April 10, 2020, 10:16am

I’ve been trying various combinations now and none seems to work.
The official nvidia/cuda:10.1-base-ubuntu18.04 container together with cudatoolkit=10.1 should have the corresponding versions, no?

Which combinations should work?

I’ve taken a look at your table already various times, but none of the combinations seem to work for me.

SunilJB · April 10, 2020, 1:07pm

Hi,

Could you please check if the driver version installed on host system is satisfying the support matrix?

Thanks

lukas.heumos · April 10, 2020, 1:12pm

All right, I think I solved it.

This is some really weird shit:

so environment.yml to install that WORKS:

name: tensorflow-2.1-cuda-10.1
channels:
    - conda-forge
    - defaults
dependencies:
    - defaults::cudatoolkit=10.1
    # We need to install tensorflow-gpu, since else we get:
    # 2020-04-10 13:08:01.736473: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
    # 2020-04-10 13:08:01.736534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.82.0
    # 2020-04-10 13:08:01.736556: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.82.0 does not match DSO version 418.87.1 -- cannot find working devices in this configuration
    - defaults::tensorflow-gpu=2.1.0
    - conda-forge::graphviz=2.40.1
    - conda-forge::python-graphviz=0.13.2
    - pip
    - pip:
      - tensorflow==2.2.0rc2

Environment that does NOT work:

name: tensorflow-2.1-cuda-10.1
channels:
    - conda-forge
    - defaults
dependencies:
    - defaults::cudatoolkit=10.1
    - conda-forge::graphviz=2.40.1
    - conda-forge::python-graphviz=0.13.2
    - pip
    - pip:
      - tensorflow==2.2.0rc2

So apparently I need to install tensorflow-gpu, which is actually deprecated for tensorflow (standalone), which supposedly contains GPU support.
Is this a tensorflow issue or is that an issue in your container?

Cheers

Topic		Replies	Views
TensorFlow cannot find cuDNN [Ubuntu 16.04 + CUDA7.5] CUDA Setup and Installation	12	42557	February 10, 2017
Theano / cuDNN 7.6.2 / CUDA 10.1 / Ubuntu 18.04 --> libcudnn.so.7 not found cuDNN	2	2159	October 12, 2021
Incomplete cuDNN 7.6.5 installation CUDA Setup and Installation	2	1077	October 14, 2020
Failing to install cuDNN on several PCs cuDNN	5	1507	October 12, 2021
could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR cuDNN	1	713	November 30, 2019
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	9941	October 12, 2021
Couldn't open CUDA library libcudnn.so. CUDA Setup and Installation	2	3684	November 14, 2016
container 'nvcr.io/nvidia/tensorflow:18.10-py3' has no tensorflow Deep Learning (Training & Inference)	0	778	November 15, 2019
CUDA drivers insufficient Frameworks (archived) tensorflow	31	2724	October 12, 2021
libcudnn.so.7: cannot open shared object file: No such file or directory Jetson TX2	10	11810	October 18, 2021

Dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory;

Install some basic utilities

Create a working directory and set it as default

Create a non-root user and switch to it

All users can use /home/user as their home directory

Install Miniconda

Update Conda first

Install the conda environment

Activate the environment

Dump the details of the installed packages to a file for posterity

Related topics