Slow tensorflow-gpu execution on A100

Hello,
running the python script which includes tensorflow-gpu in A100 GPU is taking more than 20 min and running the same script in the same environment in Quadro RTX 5000 or GeForce RTX 208 is taking less than 1 minute.

the long waiting time is related to:

2021-04-05 12:34:13.781180: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

2021-04-05 12:54:17.180852: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10`

What can be a reason of the such long waiting time?

Thank you.
Inga

Environment

TensorRT Version:
GPU Type: A100-PCIE-40GB
Nvidia Driver Version: 450.51.06
CUDA Version: 11.0
CUDNN Version: 7.6.5
Operating System + Version: Red Hat Enterprise Linux release 8.3
Python Version (if applicable): 3.7.9
TensorFlow Version (if applicable): 2.1.0
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): N/A

Relevant Files

python code:

import tensorflow as tf
import time
import timeit

device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
print(
'\n\nThis error most likely means that this notebook is not ’
'configured to use a GPU. Change this in Notebook Settings via the ’
‘command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n’)
raise SystemError(‘GPU device not found’)

def cpu():
with tf.device(’/cpu:0’):
random_image_cpu = tf.random.normal((100, 100, 100, 3))
net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
return tf.math.reduce_sum(net_cpu)

def gpu():
print(’ with tf.device(/GPU:0) time start’)
with tf.device(’/GPU:0’):
print(’ with tf.device(/GPU:0): time run = ',(toc-tic)/60)
random_image_gpu = tf.random.normal((100, 100, 100, 3))
net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
return tf.math.reduce_sum(net_gpu)

We run each op once to warm up; see: https://stackoverflow.com/a/45067900

print(‘cpu() time start’)
cpu()

print(‘gpu() time start’)
gpu()

Run the op several times.

print(‘cpu_time = timeit.timeit time start’)

cpu_time = timeit.timeit(‘cpu()’, number=10, setup=“from main import cpu”)
gpu_time = timeit.timeit(‘gpu()’, number=10, setup=“from main import gpu”)

print(‘CPU (s):’)
print(cpu_time)

print(‘GPU (s):’)
print(gpu_time)

print(‘GPU speedup over CPU: {}x’.format(int(cpu_time/gpu_time)))
toc_total = time.time()
print(’ total time run = ',(toc_total-tic_total)/60)