Cudnn TF32 performs no better than FP32 on RTX3090

824023604 · January 13, 2021, 6:25am

Description

I want to compare the performance of convolutions with TF32 and FP32 on RTX3090, I find that TF32 is no better than FP32. Why?

Environment

TensorRT Version:
GPU Type: GeForce RTX 3090
Nvidia Driver Version: 455.38
CUDA Version: 11.1
CUDNN Version: 8.0.5
Operating System + Version: CentOS Linux release 7.4.1708 (Core)
Python Version (if applicable): 3.6.8
TensorFlow Version (if applicable): 2.4.0
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

import tensorflow as tf
import numpy as np
tf.config.experimental.enable_tensor_float_32_execution(False)
x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])
x = tf.constant(x_in, dtype=tf.float32)
kernel = tf.constant(kernel_in, dtype=tf.float32)
out = tf.nn.conv2d(x, kernel, strides=[1, 1, 1, 1], padding='VALID')

Steps To Reproduce

Save the code to a file “test_conv.py”, and execute command “nsys nvprof python3 test_conv.py” in a terminal, you can see the time of every kernel.

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · January 15, 2021, 6:15am

Hi @824023604,
is this really the problem size?

824023604:

x_in = np.array([[
  [[2], [1], [2], [0], [1]],
  [[1], [3], [2], [2], [3]],
  [[1], [1], [3], [3], [0]],
  [[2], [2], [0], [1], [1]],
  [[0], [0], [3], [1], [2]], ]])
kernel_in = np.array([
 [ [[2, 0.1]], [[3, 0.2]] ],
 [ [[0, 0.3]],[[1, 0.4]] ], ])

if so, these are quite small and unable to fully occupy even 1 SM.

Thanks!

Topic		Replies	Views
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2609	January 28, 2021
TensorRT 6 slower than TensorFlow with 3D convolutions and pooling TensorRT	6	1622	December 20, 2019
Performace on A100SXM40GB TF32 vs FP32 CUDA Programming and Performance cuda , ampere	1	1079	January 26, 2023
TF32 GEMM sample very slow compared to generic GEMM CUDA Programming and Performance	5	881	June 30, 2022
Conv3D does not use Tensor Cores TensorRT tensorrt , cuda , cudnn	8	1162	October 23, 2020
Models running in Cuda Cores or Tensor Cores TensorRT cudnn , inception	1	82	November 4, 2025
TensorRT inference speed is lower than pytorch TensorRT	3	932	March 29, 2023
Tensorrt can not speed up well TensorRT	7	1788	June 29, 2022
Disabling TF32 in cuDNN at runtime on Ampere cuDNN	5	1791	August 11, 2022
Does TensorRT support conv3d with Tensor Core ? TensorRT	13	2070	April 26, 2021

Cudnn TF32 performs no better than FP32 on RTX3090

Description

Environment

Relevant Files

Steps To Reproduce

Related topics