If we used your logic to do the calculation of FP16 throughput chip-wide, we would get 16x16x16x2x432x1410
which yields a number much higher than the actual spec throughput (312TFLOPs/s).
The assumption that a single TC unit has a throughput of 1 op/clk for each and every combination of TC ops listed in the programming guide cannot be correct.
The A100 whitepaper also says that the throughput of a single TC unit is 256 FMA ops/cycle for FP16, and this number is consistent with the stated chip-wide throughput.
So I think you should use the 19.5 TFLOPs/s as the stated chip-wide throughput (for DP TC ops), and if you need a per unit/per clock throughput, do the division. If I do this arithmetic, I reach the conclusion that the A100 TC unit has a throughput of 16 DP FMA ops per unit per cycle:
19,500,000 / (1410x2x432) = 16.0
Unlike some other functional units in the GPU, the TC units do not necessarily seem to be able to schedule 1 op per cycle on average.