I am operating a few GPU servers in our team where we have 5 servers with 4 GPUs
and a few workstations with individual GPUs. The types of GPUs in use are the following
GeForce GTX 1050 Ti with Max-Q Design
GeForce GTX 1050 Ti
GeForce RTX 2080 Ti
GeForce GTX 1080 Ti
TITAN Xp COLLECTORS EDITION
We use tensorflow 1.12, 1.13 1.14 2.0 and 2.1 all installed via anaconda.
Until I intalled TF2.1 everything went fine. With TF2.1 most of our training software does not work throwing the CUDNN internal error directly when the CUDA libraries are loaded. I underline that exactly the same code works fine on T2.0! I have a minimal example that fails
here below.
More precisely the installation that I tried is tensorflow-gpu=2.1 with cudatoolkit=10.1 from anaconda main repos, but I tried as well installing tensorflow-gpu via pip with exactly the same result. I can reproduce this under linux-ubuntu 18.04 and debian 9.12 with the cards
GeForce GTX 1050 Ti with Max-Q Design
GeForce GTX 1050 Ti
GeForce RTX 2080 Ti
but on the two other cards available in our team
GeForce GTX 1080 Ti
TITAN Xp COLLECTORS EDITION
the very same code runs fine on the installations containg the very same TF21/cuda versions
Interestingly one of the people in the list of the bug report managed to work around the problem by means of installing the last driver
Installing the latest driver (445.87) for my RTX 2080 solved this issue for me.
Unfortunately the last driver for linux is not the version 445.87 and after installing the last driver available on my computer I could not see any change.
My minimal problem is below. Interestingly the problem is not conv2d. I can change the order of these three commands and it is always the third that one fails. Allowing growth by means of adding command line option -a makes the script finish without problem on our TF21. installation.
import sys
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus and len(sys.argv)> 1 and sys.argv[1].startswith("-a"):
print("allowing growth")
growth = True
else:
print("nogrowth")
growth = False
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, growth)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
tf.matmul(tf.zeros((2,2,2)), tf.zeros((2,2,2)))
tf.signal.stft(tf.zeros(3000, dtype=tf.float32), 512, 128)
tf.nn.conv2d(tf.zeros((2,20,20,20), dtype=tf.float32),
filters=tf.zeros((2,2,20,20), dtype=tf.float32),
strides=(1,1,1,1), padding="VALID")
print("done")
the last lines of the log is as follows
2020-03-06 17:06:48.920491: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-06 17:06:49.029343: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-06 17:06:49.473013: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-06 17:06:49.474368: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
nogrowth
1 Physical GPUs, 1 Logical GPUs
Traceback (most recent call last):
File "./run_cuda_con2d_last.py", line 24, in <module>
strides=(1,1,1,1), padding="VALID")
File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 1914, in conv2d_v2
name=name)
File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2011, in conv2d
name=name)
File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 937, in conv2d
_ops.raise_from_not_ok_status(e, name)
File "/data/anasynth/anaconda3/envs/tf2.1/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]
I dont see a means to provide full log files here. You can find them for various changes in the order of invocations of the different operations here