Slow tensorflow-gpu execution on A100 GPU


running the python script which includes tensorflow-gpu in A100 GPU is taking more than 20 min and running the same script in the same environment in Quadro RTX 5000 or GeForce RTX 208 is taking less than 1 minute.

the long waiting time is related to:

2021-04-05 12:34:13.781180: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library

2021-04-05 12:54:17.180852: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library`

What can be a reason of the such long waiting time?

Thank you.


TensorRT Version: 2.1.0
GPU Type: A100-PCIE-40GB
Nvidia Driver Version: 450.51.06
CUDA Version: 11.0
CUDNN Version: 7.6.5
Operating System + Version: Red Hat Enterprise Linux release 8.3
Python Version (if applicable): 3.7.9
TensorFlow Version (if applicable): 2.1.0
PyTorch Version (if applicable): N/A
Baremetal or Container (if container which image + tag): N/A

Relevant Files

python code:

import tensorflow as tf
import time
import timeit

device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
'\n\nThis error most likely means that this notebook is not ’
'configured to use a GPU. Change this in Notebook Settings via the ’
‘command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n’)
raise SystemError(‘GPU device not found’)

def cpu():
with tf.device(’/cpu:0’):
random_image_cpu = tf.random.normal((100, 100, 100, 3))
net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
return tf.math.reduce_sum(net_cpu)

def gpu():
print(’ with tf.device(/GPU:0) time start’)
with tf.device(’/GPU:0’):
print(’ with tf.device(/GPU:0): time run = ',(toc-tic)/60)
random_image_gpu = tf.random.normal((100, 100, 100, 3))
net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
return tf.math.reduce_sum(net_gpu)

We run each op once to warm up; see:

print(‘cpu() time start’)

print(‘gpu() time start’)

Run the op several times.

print(‘cpu_time = timeit.timeit time start’)

cpu_time = timeit.timeit(‘cpu()’, number=10, setup=“from main import cpu”)
gpu_time = timeit.timeit(‘gpu()’, number=10, setup=“from main import gpu”)

print(‘CPU (s):’)

print(‘GPU (s):’)

print(‘GPU speedup over CPU: {}x’.format(int(cpu_time/gpu_time)))
toc_total = time.time()
print(’ total time run = ',(toc_total-tic_total)/60)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi @paster,

This doesn’t look like TensorRT related issue, we request you to post your concern on tensorflow related platform to get better help.

Thank you.