How to calculate the Tensor Core FP16 performance of H100?

The whitepaper of H100 claims its Tensor Core FP16 with FP32 accumulate to have a performance of 756 TFLOPS for the PCIe version.

I have read all the white papers of data center GPUs since Volta. The performance of Tensor Core FP16 with FP32 accumulate is always four times the vanilla FP16 as there are always four times as many Tensor Cores. If that’s the case, the performance for H100 PCIe should also be 409.76 TFLOPS but 756 is claimed by the whitepaper.

What is going on here? Is it because of the TMA added? Is there a formula I can use to calculate the 756 TFLOPS or is it an empirical measurement?

Interestingly, Tensor Core FP64 performs as expected at 2x the performance of vanilla FP64 unlike TF32,FP16, INT8, FP8.

The volta whitepaper indicates explicitly that each TC unit in Volta delivers 64 FMA ops per clock (equals 128 FLOPs/clk). When looked at from an SM perspective, the SM as a whole (having 8 TC units) is capable of 1024 FLOPs/clk. This seems to line up with stated numbers for V100 FP16 TC throughput which vary over a range of approximately 112 to 130 TFLOP/s depending on sku/variant. Let’s convince ourselves of that. Considering the V100 PCIE with 80 SMs, this would be

80 x 1024 = 81920 FLOPs/clk

Dividing the stated 112TFLOP/s performance of V100 PCIE by that number:

112,000,000 MFLOP/s / 81920 FLOP/clk = 1367 Mclk/s = 1367MHz

Which is a clock rate that is in line with the stated boost clock of V100.

Moving on to Ampere A100, the whitepaper states that the A100 TC unit delivers 256 FMA ops/clk, and considered at the SM level (four 3rd gen TC units/SM) this translates to 1024 FMA ops/clk, or 2048 FLOPs/clk, a doubling of the TC throughput for FP16 (non-sparsity) when comparing a Volta SM to an Ampere SM, clock-for-clock. Likewise we can confirm the stated 312 TFLOP/s number for A100 with 108 SMs in a similar fashion:

108 x 2048 = 221,184 FLOP/clk

and

312,000,000 MFLOP/s / 221,184 FLOP/clk = 1410M clk/s = 1410MHz

which is again in line with the stated/published boost clock for the A100 GPU.

Moving on to Hopper H100, the whitepaper simply states that the per SM throughput is again doubled compared to Ampere. So we are now at 4096 FLOP/clk per SM.

The H100 PCIE has 114 SMs, so we get, per GPU:

114 x 4096 = 466,944 FLOP/clk

The stated perf is 756 TFLOP/s, so

756,000,000 MFLOP/s / 466,944 FLOP/clk = 1620M clk/s = 1620MHz

The H100 PCIE board specification lists a max boost frequency of 1755MHz.

But, as pointed out below, table 3 in the H100 white paper indicates that max boost clock for TC usage on H100 PCIE is 1620MHz. So this calculation lines up with the stated boost frequency.

Thanks for your detailed reply. I think I made a mistake of not doubling the TC performacnce.

Other than that, the discreprancy can be attributed to different boost frequency for CUDA cores plus FP64 TC and TCs other than FP64. For the former, the boost frequency used is 1755MHz and for the later, it is 1620MHz.

CUDA cores per SM and theoretical performance based on 1755MHz boost)
FP64 64 => 641141.7552/1000 = 25.60896 TFLOPS
FP32 128 => 128
1141.7552/1000 = 51.21792 TFLOPS
FP16 256 => 2561141.755*2/1000 = 102.43584 TFLOPS

Tensor Cores per SM and theoretical performance based on 1620MHz boost
FP64 128 => 1281141.7552/1000 = 51.21792 TFLOPS
TF32 1024 => 1024
1141.622/1000 = 378.22464 TFLOPS
FP16 2048 => 20481141.622/1000 = 756.44928 TFLOPS
FP8 4096 => 4096
1141.622/1000 = 1512.89856 TFLOPS
INT8 4096 => 40961141.62*2/1000 = 1512.89856 TFLOPS

These numbers finally agree with the published numbers. I think probably all the discreprancies are due to the reduction of boost frequency from 1755 to 1620. Probably someone forgot to update the performance of CUDA core and TC FP64 with lower clock.

Oops… I made another mistake. The Table 3 of H100 whitepaper explicitly states that Boost Clock for all TC and FP64 CUDA are running at 1620MHz whereas the rest are running at 1755MHz. So the math in my previous post are all correct.

Different boost clocks for different types of core seems to be a feature that only exists in the Hopper Architecture. That’s why the numbers seem off for H100 only.