GPU utilization for using tf32 vs fp32

I have profiled a nn.linear(1408, 1408) layer in nsight compute.
(input shape: (256, 6, 1408), output shape: (256, 6, 1408))
I used fp32 for the first profiling and it gave 73.32% of SM throughput and 63.44% of FMA pipe utilization(which seems well utilizing the compute units…).
But when I used tf32 for the same kernel(added torch.backends.cuda.matmul.allow_tf32 = True and
torch.backends.cudnn.allow_tf32 = True), SM throughput goes down to 48.22%. Also, tensor core utilization in Compute Workload Analysis saids tensor pipe utilization is only 48.34%. And the gpu__time_duration.sum is reduced by 35.51%.

I know that the SM throughput and pipe utilization can go up with increasing batch size, but the tensor pipe utilization does not go up over 50%.

I cannot understand why this is happening.

  1. Why is the tensor pipe utilization for tf32 so low?
  2. Why does the tensor pipe utilization not go up with higher batch sizes?
    (similar things happened even if i used (nn.linear(1152, 1152), input shape: (64, 16, 1152), output shape: (64, 16, 1152))

I attached screenshots of ncu-rep for both tf32 and fp32. (blue bar is tf32, and green bar is fp32). I would very appreciate if I can get any insights for this.
tf32 vs fp32.pdf (2.1 MB)