The volta whitepaper indicates explicitly that each TC unit in Volta delivers 64 FMA ops per clock (equals 128 FLOPs/clk). When looked at from an SM perspective, the SM as a whole (having 8 TC units) is capable of 1024 FLOPs/clk. This seems to line up with stated numbers for V100 FP16 TC throughput which vary over a range of approximately 112 to 130 TFLOP/s depending on sku/variant. Let’s convince ourselves of that. Considering the V100 PCIE with 80 SMs, this would be
80 x 1024 = 81920 FLOPs/clk
Dividing the stated 112TFLOP/s performance of V100 PCIE by that number:
112,000,000 MFLOP/s / 81920 FLOP/clk = 1367 Mclk/s = 1367MHz
Which is a clock rate that is in line with the stated boost clock of V100.
Moving on to Ampere A100, the whitepaper states that the A100 TC unit delivers 256 FMA ops/clk, and considered at the SM level (four 3rd gen TC units/SM) this translates to 1024 FMA ops/clk, or 2048 FLOPs/clk, a doubling of the TC throughput for FP16 (non-sparsity) when comparing a Volta SM to an Ampere SM, clock-for-clock. Likewise we can confirm the stated 312 TFLOP/s number for A100 with 108 SMs in a similar fashion:
108 x 2048 = 221,184 FLOP/clk
312,000,000 MFLOP/s / 221,184 FLOP/clk = 1410M clk/s = 1410MHz
which is again in line with the stated/published boost clock for the A100 GPU.
Moving on to Hopper H100, the whitepaper simply states that the per SM throughput is again doubled compared to Ampere. So we are now at 4096 FLOP/clk per SM.
The H100 PCIE has 114 SMs, so we get, per GPU:
114 x 4096 = 466,944 FLOP/clk
The stated perf is 756 TFLOP/s, so
756,000,000 MFLOP/s / 466,944 FLOP/clk = 1620M clk/s = 1620MHz
The H100 PCIE board specification lists a max boost frequency of 1755MHz.
But, as pointed out below, table 3 in the H100 white paper indicates that max boost clock for TC usage on H100 PCIE is 1620MHz. So this calculation lines up with the stated boost frequency.