Double precision tensor core performance on A100

Cuda c++ user guide mentions double precision matrix size of the ampere tensor cores to be 8x8x4.

The ampere architecture whitepaper mentions the peak clock is 1410 MHz.

Since there are 432 tensor cores per A100 card, the double precision FLOP/s would be:

8x8x4 mma ops x 2 floating point ops (multiplication and addition) x 432 tensor cores x 1410 MHz
~ 312 TFLOP/s

But the whitepaper says this should be 19.5 TFLOP/s.
Also, surprisingly enough, 312 TFLOP/s is mentioned as performance of FP16 data type.

Where am I wrong in my calculation?

Thanks in advance.

If we used your logic to do the calculation of FP16 throughput chip-wide, we would get 16x16x16x2x432x1410 which yields a number much higher than the actual spec throughput (312TFLOPs/s).

The assumption that a single TC unit has a throughput of 1 op/clk for each and every combination of TC ops listed in the programming guide cannot be correct.

The A100 whitepaper also says that the throughput of a single TC unit is 256 FMA ops/cycle for FP16, and this number is consistent with the stated chip-wide throughput.

So I think you should use the 19.5 TFLOPs/s as the stated chip-wide throughput (for DP TC ops), and if you need a per unit/per clock throughput, do the division. If I do this arithmetic, I reach the conclusion that the A100 TC unit has a throughput of 16 DP FMA ops per unit per cycle:

19,500,000 / (1410x2x432) = 16.0

Unlike some other functional units in the GPU, the TC units do not necessarily seem to be able to schedule 1 op per cycle on average.