Cudnn TF32 performs no better than FP32 on RTX3090

I want to compare the performance of convolutions with TF32 and FP32 on RTX3090, I find that TF32 is no better than FP32. Why?

Environment

TensorRT Version :
GPU Type : GeForce RTX 3090
Nvidia Driver Version : 455.38
CUDA Version : 11.1
CUDNN Version : 8.0.5
Operating System + Version : CentOS Linux release 7.4.1708 (Core)
Python Version (if applicable) : 3.6.8
TensorFlow Version (if applicable) : 2.4.0
PyTorch Version (if applicable) :
Baremetal or Container (if container which image + tag) :

Relevant Files

import tensorflow as tf
import numpy as np
tf.config.experimental.enable_tensor_float_32_execution(False)
x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])
x = tf.constant(x_in, dtype=tf.float32)
kernel = tf.constant(kernel_in, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')

Steps To Reproduce

Save the code to a file “test_conv.py”, and execute command “nsys nvprof python3 test_conv.py” in a terminal, you can see the time of every kernel.

Hi @824023604,
is this really the problem size?

if so, these are quite small and unable to fully occupy even 1 SM.

Thanks!

I change the shape of x_in from [1, 5, 5, 1] to [10, 5000, 5000, 1],while it has the same result.

Hi @824023604,
That’s still a single-channel image,
Please refer to the below doc to understand why the size you are running i snot giving you results.

Thanks!

sorry, I can`t see the doc you mentioned, could you please provide a link?
Thanks!

Hi @824023604 ,
Apologies.
Please find the doc here
Convolutional Layers User Guide :: NVIDIA Deep Learning Performance Documentation

Thanks!