GPU utilization for using tf32 vs fp32

dede0008 · April 1, 2024, 4:59am

Hi.
I have profiled a nn.linear(1408, 1408) layer in nsight compute.
(input shape: (256, 6, 1408), output shape: (256, 6, 1408))
I used fp32 for the first profiling and it gave 73.32% of SM throughput and 63.44% of FMA pipe utilization(which seems well utilizing the compute units…).
But when I used tf32 for the same kernel(added torch.backends.cuda.matmul.allow_tf32 = True and
torch.backends.cudnn.allow_tf32 = True), SM throughput goes down to 48.22%. Also, tensor core utilization in Compute Workload Analysis saids tensor pipe utilization is only 48.34%. And the gpu__time_duration.sum is reduced by 35.51%.

I know that the SM throughput and pipe utilization can go up with increasing batch size, but the tensor pipe utilization does not go up over 50%.

I cannot understand why this is happening.

Why is the tensor pipe utilization for tf32 so low?
Why does the tensor pipe utilization not go up with higher batch sizes?
(similar things happened even if i used (nn.linear(1152, 1152), input shape: (64, 16, 1152), output shape: (64, 16, 1152))

I attached screenshots of ncu-rep for both tf32 and fp32. (blue bar is tf32, and green bar is fp32). I would very appreciate if I can get any insights for this.
tf32 vs fp32.pdf (2.1 MB)

Topic		Replies	Views
Cudnn TF32 performs no better than FP32 on RTX3090 TensorRT	1	722	January 15, 2021
Roofline Model for Nvidia GTX1080 CUDA Programming and Performance	1	701	September 19, 2018
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2571	January 28, 2021
Performace on A100SXM40GB TF32 vs FP32 CUDA Programming and Performance cuda , ampere	1	1038	January 26, 2023
Question about tensor cores performance CUDA Programming and Performance	3	736	October 12, 2021
TF32 GEMM sample very slow compared to generic GEMM CUDA Programming and Performance	5	849	June 30, 2022
Wrong pipe utilization for Tensor (FP)? Nsight Compute	0	688	November 6, 2021
Power consumption - TensorRT Jetson TX2	4	735	October 18, 2021
Unexpected low fp16 performance on P3 Frameworks (archived) tensorflow	4	2438	October 12, 2021
Accelerating AI Training with NVIDIA TF32 Tensor Cores Technical Blog	1	580	January 29, 2021

GPU utilization for using tf32 vs fp32

Related topics