TF32 TFLOPs of GeForce RTX 3090 vs A40

Hi NVIDIA team,

I believe there is an error in the TF32 tensor TFLOPs computation in the GA102 family:
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

In particular, from page 25 we see that GA10x SM can execute:
128 FMA FP16 ops (dense) = 64 FMA FP32 ops

When computing the Peak TF32 Tensor TFLOPS for 3090 the numbers (page 44-45) are fine using the formula
FMA * Tensor Cores * GPU Boost Clock (MHz) i.e.
64 * 328 * 1695 = 35.6 TFLOPs

however the numbers don’t work for the A40 (page 15-16) of the same family i.e.
64 * 336 * 1740 = 37.4 TFLOPs which is half of the reported 74.8 TFLOPs.

Is there something I’m missing or the reported numbers are wrong?

Thank you,
Antonis

There is nothing wrong with the numbers. The TC units in each of those 2 GPUs do not necessarily act in precisely the same way. The detail description of the differences is unpublished, as far as I know.

Thank you for the quick reply!