I am operating a few GPU servers in our team where we have 5 servers with 4 GPUs
and a few workstations with individual GPUs. The types of GPUs in use are the following
GeForce GTX 1050 Ti with Max-Q Design GeForce GTX 1050 Ti GeForce RTX 2080 Ti GeForce GTX 1080 Ti TITAN Xp COLLECTORS EDITION
We use tensorflow 1.12, 1.13 1.14 2.0 and 2.1 all installed via anaconda.
Until I intalled TF2.1 everything went fine. With TF2.1 most of our training software does not work throwing the CUDNN internal error directly when the CUDA libraries are loaded. I underline that exactly the same code works fine on T2.0! I have a minimal example that fails
More precisely the installation that I tried is tensorflow-gpu=2.1 with cudatoolkit=10.1 from anaconda main repos, but I tried as well installing tensorflow-gpu via pip with exactly the same result. I can reproduce this under linux-ubuntu 18.04 and debian 9.12 with the cards
GeForce GTX 1050 Ti with Max-Q Design GeForce GTX 1050 Ti GeForce RTX 2080 Ti
but on the two other cards available in our team
GeForce GTX 1080 Ti TITAN Xp COLLECTORS EDITION
the very same code runs fine on the installations containg the very same TF21/cuda versions
Following the discussion here https://github.com/tensorflow/tensorflow/issues/24496
I discovered a work around that consists in allowing memory growth (see in the code below).
Interestingly one of the people in the list of the bug report managed to work around the problem by means of installing the last driver
Installing the latest driver (445.87) for my RTX 2080 solved this issue for me.
Unfortunately the last driver for linux is not the version 445.87 and after installing the last driver available on my computer I could not see any change.
My minimal problem is below. Interestingly the problem is not conv2d. I can change the order of these three commands and it is always the third that one fails. Allowing growth by means of adding command line option -a makes the script finish without problem on our TF21. installation.
import sys import tensorflow as tf gpus = tf.config.experimental.list_physical_devices('GPU') if gpus and len(sys.argv)> 1 and sys.argv.startswith("-a"): print("allowing growth") growth = True else: print("nogrowth") growth = False try: for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, growth) logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") except RuntimeError as e: print(e) tf.matmul(tf.zeros((2,2,2)), tf.zeros((2,2,2))) tf.signal.stft(tf.zeros(3000, dtype=tf.float32), 512, 128) tf.nn.conv2d(tf.zeros((2,20,20,20), dtype=tf.float32), filters=tf.zeros((2,2,20,20), dtype=tf.float32), strides=(1,1,1,1), padding="VALID") print("done")
the last lines of the log is as follows
2020-03-06 17:06:48.920491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10 2020-03-06 17:06:49.029343: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-03-06 17:06:49.473013: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2020-03-06 17:06:49.474368: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR nogrowth 1 Physical GPUs, 1 Logical GPUs Traceback (most recent call last): File "./run_cuda_con2d_last.py", line 24, in <module> strides=(1,1,1,1), padding="VALID") File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1914, in conv2d_v2 name=name) File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2011, in conv2d name=name) File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 937, in conv2d _ops.raise_from_not_ok_status(e, name) File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status six.raise_from(core._status_to_exception(e.code, message), None) File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]
I dont see a means to provide full log files here. You can find them for various changes in the order of invocations of the different operations here
Any help would be much appreciated