How to calculate the Tensor Core FP16 performance of H100?

ymc · March 3, 2023, 2:23am

The whitepaper of H100 claims its Tensor Core FP16 with FP32 accumulate to have a performance of 756 TFLOPS for the PCIe version.

I have read all the white papers of data center GPUs since Volta. The performance of Tensor Core FP16 with FP32 accumulate is always four times the vanilla FP16 as there are always four times as many Tensor Cores. If that’s the case, the performance for H100 PCIe should also be 409.76 TFLOPS but 756 is claimed by the whitepaper.

What is going on here? Is it because of the TMA added? Is there a formula I can use to calculate the 756 TFLOPS or is it an empirical measurement?

Interestingly, Tensor Core FP64 performs as expected at 2x the performance of vanilla FP64 unlike TF32,FP16, INT8, FP8.

Robert_Crovella · March 3, 2023, 5:53pm

The volta whitepaper indicates explicitly that each TC unit in Volta delivers 64 FMA ops per clock (equals 128 FLOPs/clk). When looked at from an SM perspective, the SM as a whole (having 8 TC units) is capable of 1024 FLOPs/clk. This seems to line up with stated numbers for V100 FP16 TC throughput which vary over a range of approximately 112 to 130 TFLOP/s depending on sku/variant. Let’s convince ourselves of that. Considering the V100 PCIE with 80 SMs, this would be

80 x 1024 = 81920 FLOPs/clk

Dividing the stated 112TFLOP/s performance of V100 PCIE by that number:

112,000,000 MFLOP/s / 81920 FLOP/clk = 1367 Mclk/s = 1367MHz

Which is a clock rate that is in line with the stated boost clock of V100.

Moving on to Ampere A100, the whitepaper states that the A100 TC unit delivers 256 FMA ops/clk, and considered at the SM level (four 3rd gen TC units/SM) this translates to 1024 FMA ops/clk, or 2048 FLOPs/clk, a doubling of the TC throughput for FP16 (non-sparsity) when comparing a Volta SM to an Ampere SM, clock-for-clock. Likewise we can confirm the stated 312 TFLOP/s number for A100 with 108 SMs in a similar fashion:

108 x 2048 = 221,184 FLOP/clk

and

312,000,000 MFLOP/s / 221,184 FLOP/clk = 1410M clk/s = 1410MHz

which is again in line with the stated/published boost clock for the A100 GPU.

Moving on to Hopper H100, the whitepaper simply states that the per SM throughput is again doubled compared to Ampere. So we are now at 4096 FLOP/clk per SM.

The H100 PCIE has 114 SMs, so we get, per GPU:

114 x 4096 = 466,944 FLOP/clk

The stated perf is 756 TFLOP/s, so

756,000,000 MFLOP/s / 466,944 FLOP/clk = 1620M clk/s = 1620MHz

The H100 PCIE board specification lists a max boost frequency of 1755MHz.

But, as pointed out below, table 3 in the H100 white paper indicates that max boost clock for TC usage on H100 PCIE is 1620MHz. So this calculation lines up with the stated boost frequency.

ymc · March 6, 2023, 1:48am

Thanks for your detailed reply. I think I made a mistake of not doubling the TC performacnce.

Other than that, the discreprancy can be attributed to different boost frequency for CUDA cores plus FP64 TC and TCs other than FP64. For the former, the boost frequency used is 1755MHz and for the later, it is 1620MHz.

CUDA cores per SM and theoretical performance based on 1755MHz boost)
FP64 64 => 641141.7552/1000 = 25.60896 TFLOPS
FP32 128 => 1281141.7552/1000 = 51.21792 TFLOPS
FP16 256 => 2561141.755*2/1000 = 102.43584 TFLOPS

Tensor Cores per SM and theoretical performance based on 1620MHz boost
FP64 128 => 1281141.7552/1000 = 51.21792 TFLOPS
TF32 1024 => 10241141.622/1000 = 378.22464 TFLOPS
FP16 2048 => 20481141.622/1000 = 756.44928 TFLOPS
FP8 4096 => 40961141.622/1000 = 1512.89856 TFLOPS
INT8 4096 => 40961141.62*2/1000 = 1512.89856 TFLOPS

These numbers finally agree with the published numbers. I think probably all the discreprancies are due to the reduction of boost frequency from 1755 to 1620. Probably someone forgot to update the performance of CUDA core and TC FP64 with lower clock.

ymc · March 6, 2023, 8:25am

Oops… I made another mistake. The Table 3 of H100 whitepaper explicitly states that Boost Clock for all TC and FP64 CUDA are running at 1620MHz whereas the rest are running at 1755MHz. So the math in my previous post are all correct.

Different boost clocks for different types of core seems to be a feature that only exists in the Hopper Architecture. That’s why the numbers seem off for H100 only.

nobond · January 22, 2024, 2:49am

It seems very interesting per SM you always have 1(Tensore Core) : 8 (Cuda Core) Ratio

Robert_Crovella · January 22, 2024, 3:40am

I’m not sure what you mean. The H100 for example lists 128 CUDA cores per SM and 4 TC units per SM. That is not a 1:8 ratio. Volta does have a 1:8 ratio.

nobond · January 22, 2024, 7:38am

Apology, indeed, the TC units did change from A100,H100 compared agasing V100/T4 etc. I think I should saying that if you ignore the number, but focusing on the flops which is compareable between the two. If we focus on the normal FP16 type. The tensor core always have a 8X (excpet A100 is an outlier i think cuda core has an extra FP16 path). V100/Turing/H100 are both the same.

Regards
Yao

zldrobit · August 13, 2024, 9:36am

Thanks for your detailed explanation. However, I found that RTX A6000 does not obey this formula. The fp16 with fp32 accumulation TC flops of RTX A6000 were 309.6576T if calculated using the formula as
256 FMA/clk/TC x 2 flops/FMA x 4 TC/SM x 84 SM x 1800MHz = 309.6576T flops
, but the ampere architecture white paper v2 says it is 154.8T, which is half of the theory one.

Greg · August 13, 2024, 1:50pm

The tensor core architecture and rates change between architectures, sub-architectures (100 class vs. !100 class), and SKUs.

The latest version of Nsight Compute has Roofline for all tensor formats. These can be collected using the metrics

sm__ops_path_tensor_*

These are broken down by src_{srcfmt}dst{dstfmt}[sparsity_{on,off}].

using
.sum - total number of operations
.sum.per_second is the total operations per second
.avg.peak_sustained is the theoretical operations per second per SM
.sum.peak_sustained is the theoretical oeprations per second per GPU

GH100 metrics support both warp MMA operations and warp group MMA operations.

sm__ops_path_tensor_op_hmma_src_fp16
sm__ops_path_tensor_op_hgmma_src_fp16

zldrobit · August 14, 2024, 7:44am

Thank you. So, I now understand why my calculation differentiates from the above formula. I guess the operations speed for RTX A6000 is 128 FMA/clk. Since I don’t have an A6000 now, I’m just choosing which type of GPU to buy. IMHO, others could choose more easily if such information is published in the datasheet or whitepaper.

Topic		Replies	Views
Question about tensor cores performance CUDA Programming and Performance	3	743	October 12, 2021
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	840	November 25, 2024
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	1029	December 10, 2024
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	6084	June 21, 2022
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2643	August 12, 2017
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	1060	July 7, 2023
Titan V FP16 Performance CUDA Programming and Performance	5	4316	December 13, 2017
GPU single and double precision FLOPs CUDA Programming and Performance	1	7539	June 16, 2009
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5294	May 31, 2019
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1208	April 26, 2018

How to calculate the Tensor Core FP16 performance of H100?

Related topics