Hi NVIDIA team,
I believe there is an error in the TF32 tensor TFLOPs computation in the GA102 family:
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf
In particular, from page 25 we see that GA10x SM can execute:
128 FMA FP16 ops (dense) = 64 FMA FP32 ops
When computing the Peak TF32 Tensor TFLOPS for 3090 the numbers (page 44-45) are fine using the formula
FMA * Tensor Cores * GPU Boost Clock (MHz) i.e.
64 * 328 * 1695 = 35.6 TFLOPs
however the numbers don’t work for the A40 (page 15-16) of the same family i.e.
64 * 336 * 1740 = 37.4 TFLOPs which is half of the reported 74.8 TFLOPs.
Is there something I’m missing or the reported numbers are wrong?
Thank you,
Antonis