Cudnn TF32 performs no better than FP32 on RTX3090

Description

I want to compare the performance of convolutions with TF32 and FP32 on RTX3090, I find that TF32 is no better than FP32. Why?

Environment

TensorRT Version:
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 455.38
CUDA Version: 11.1
CUDNN Version: 8.0.5
Operating System + Version: CentOS Linux release 7.4.1708 (Core)
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): 2.4.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

import tensorflow as tf
import numpy as np
tf.config.experimental.enable_tensor_float_32_execution(False)
x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])
x = tf.constant(x_in, dtype=tf.float32)
kernel = tf.constant(kernel_in, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')

Steps To Reproduce

Save the code to a file “test_conv.py”, and execute command “nsys nvprof python3 test_conv.py” in a terminal, you can see the time of every kernel.

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi @824023604,
is this really the problem size?

if so, these are quite small and unable to fully occupy even 1 SM.

Thanks!