Performace on A100SXM40GB TF32 vs FP32

shb8086 · January 26, 2023, 11:54am

I do a matmul on two 10240×10240 matrices.
As I know when I activate TF32 mode on A100 I should get performance.
Based on the report, it should be :
FP32: 0.14s
TF32: 0.018s
But I get:
FP32: 0.619
TF32: 1.785

CUDA version 11.7
PyTorch 1.13

What is the problem with this code?
By running the FP32 before TF32, the data should even be o the RAM. But still the performance is worse.

import torch
import time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

a_full = torch.randn(10240, 10240, dtype=torch.float, device='cuda')
b_full = torch.randn(10240, 10240, dtype=torch.float, device='cuda')
ab_full = a_full @ b_full

a = a_full.float()
b = b_full.float()

# Do matmul with TF32 disabled.
torch.backends.cudnn.allow_tf32 = False
torch.backends.cuda.matmul.allow_tf32 = False
start = time.time()
ab_fp32 = a @ b # takes 0.14s on GA100
end = time.time()
print("FP32:",(end - start)*1000)

# Do matmul at TF32 mode.
torch.backends.cudnn.allow_tf32 = True
torch.backends.cuda.matmul.allow_tf32 = True
start = time.time()
ab_tf32 = a @ b # takes 0.018s on GA100
end = time.time()
print("TF32:",(end - start)*1000)

shb8086 · January 26, 2023, 3:08pm

I just found here how to fix it: Link

Topic		Replies	Views
Cudnn TF32 performs no better than FP32 on RTX3090 TensorRT	1	742	January 15, 2021
TF32 GEMM sample very slow compared to generic GEMM CUDA Programming and Performance	5	889	June 30, 2022
Fp32 & a100 GPU-Accelerated Libraries cublas	3	838	December 16, 2021
Accelerating AI Training with NVIDIA TF32 Tensor Cores Technical Blog	1	597	January 29, 2021
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2619	January 28, 2021
Disabling TF32 in cuDNN at runtime on Ampere cuDNN	5	1806	August 11, 2022
Accelerating TensorFlow on NVIDIA A100 GPUs Technical Blog	0	553	August 25, 2020
GPU utilization for using tf32 vs fp32 Nsight Compute cuda , nsight , cudnn	0	2041	April 1, 2024
Bf16 slower than fp32 on A10 and A100? CUDA Programming and Performance cuda , kernel , deep-learning , a100	4	1797	July 13, 2024
Unexpected low fp16 performance on P3 Frameworks (archived) tensorflow	4	2470	October 12, 2021

Performace on A100SXM40GB TF32 vs FP32

Related topics