Hi.
I have profiled a nn.linear(1408, 1408) layer in nsight compute.
(input shape: (256, 6, 1408), output shape: (256, 6, 1408))
I used fp32 for the first profiling and it gave 73.32% of SM throughput and 63.44% of FMA pipe utilization(which seems well utilizing the compute units…).
But when I used tf32 for the same kernel(added torch.backends.cuda.matmul.allow_tf32 = True and
torch.backends.cudnn.allow_tf32 = True), SM throughput goes down to 48.22%. Also, tensor core utilization in Compute Workload Analysis saids tensor pipe utilization is only 48.34%.  And the gpu__time_duration.sum is reduced by 35.51%.
I know that the SM throughput and pipe utilization can go up with increasing batch size, but the tensor pipe utilization does not go up over 50%.
I cannot understand why this is happening.
- Why is the tensor pipe utilization for tf32 so low?
- Why does the tensor pipe utilization not go up with higher batch sizes?
 (similar things happened even if i used (nn.linear(1152, 1152), input shape: (64, 16, 1152), output shape: (64, 16, 1152))
I attached screenshots of ncu-rep for both tf32 and fp32. (blue bar is tf32, and green bar is fp32). I would very appreciate if I can get any insights for this.
tf32 vs fp32.pdf (2.1 MB)




