Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

Hello

I am operating a few GPU servers in our team where we have 5 servers with 4 GPUs
and a few workstations with individual GPUs. The types of GPUs in use are the following

   GeForce GTX 1050 Ti with Max-Q Design   
   GeForce GTX 1050 Ti
   GeForce RTX 2080 Ti
  GeForce GTX 1080 Ti
  TITAN Xp COLLECTORS EDITION

We use tensorflow 1.12, 1.13 1.14 2.0 and 2.1 all installed via anaconda.
Until I intalled TF2.1 everything went fine. With TF2.1 most of our training software does not work throwing the CUDNN internal error directly when the CUDA libraries are loaded. I underline that exactly the same code works fine on T2.0! I have a minimal example that fails
here below.

More precisely the installation that I tried is tensorflow-gpu=2.1 with cudatoolkit=10.1 from anaconda main repos, but I tried as well installing tensorflow-gpu via pip with exactly the same result. I can reproduce this under linux-ubuntu 18.04 and debian 9.12 with the cards

   GeForce GTX 1050 Ti with Max-Q Design   
   GeForce GTX 1050 Ti
   GeForce RTX 2080 Ti

but on the two other cards available in our team

  GeForce GTX 1080 Ti
  TITAN Xp COLLECTORS EDITION

the very same code runs fine on the installations containg the very same TF21/cuda versions

Following the discussion here https://github.com/tensorflow/tensorflow/issues/24496
I discovered a work around that consists in allowing memory growth (see in the code below).

Interestingly one of the people in the list of the bug report managed to work around the problem by means of installing the last driver

Installing the latest driver (445.87) for my RTX 2080 solved this issue for me.

Unfortunately the last driver for linux is not the version 445.87 and after installing the last driver available on my computer I could not see any change.

My minimal problem is below. Interestingly the problem is not conv2d. I can change the order of these three commands and it is always the third that one fails. Allowing growth by means of adding command line option -a makes the script finish without problem on our TF21. installation.

import sys
import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus and len(sys.argv)> 1 and sys.argv[1].startswith("-a"):
    print("allowing growth")
    growth = True
else:
    print("nogrowth")
    growth = False

try:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, growth)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
    print(e)
    
tf.matmul(tf.zeros((2,2,2)), tf.zeros((2,2,2)))
tf.signal.stft(tf.zeros(3000, dtype=tf.float32), 512, 128)
tf.nn.conv2d(tf.zeros((2,20,20,20), dtype=tf.float32),
                                         filters=tf.zeros((2,2,20,20), dtype=tf.float32),
            strides=(1,1,1,1), padding="VALID")
print("done")

the last lines of the log is as follows

2020-03-06 17:06:48.920491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-06 17:06:49.029343: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-06 17:06:49.473013: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-06 17:06:49.474368: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
nogrowth
1 Physical GPUs, 1 Logical GPUs
Traceback (most recent call last):
  File "./run_cuda_con2d_last.py", line 24, in <module>
    strides=(1,1,1,1), padding="VALID")
  File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1914, in conv2d_v2
    name=name)
  File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2011, in conv2d
    name=name)
  File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 937, in conv2d
    _ops.raise_from_not_ok_status(e, name)
  File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

I dont see a means to provide full log files here. You can find them for various changes in the order of invocations of the different operations here

Any help would be much appreciated
Axel

Just to avoid anybody believing that my installation is broken!

I just repeated the test with the official tensorflow/tensorflof:2.1.0-gpu-py3 docker.
It has exactly the same problem.

Thanks
Axel