Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor

whatdhack · June 18, 2022, 6:56pm

Wondering how the theoretical TFLOPS numbers are calculated for lower precisions. In the Table 3 of blog post on H100 ( https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ ) , copied below, TFLOPS for various precisions are available. For A100, BF16 (non-tensor) seems to be double that of FP32. That makes sense as 2 ops of BF16 are executed in place of 1 op of FP32. However FP16 ( non-tensor) appears to be further 2x higher - what is the reason for that ? TF32 (tensor) is 8x of FP32 (non-tensor), and BF16 (tensor) is also 8x of BF16 ( non-tensor)

GPU Features NVIDIA A100 NVIDIA H100 SXM5 1 NVIDIA H100 PCIe
|Peak FP16 Tensor TFLOPS with FP16 Accumulate|312/6242|1000/20002|800/16002|
|Peak FP16 Tensor TFLOPS with FP32 Accumulate|312/6242|1000/20002|800/16002|
|Peak BF16 Tensor TFLOPS with FP32 Accumulate|312/6242|1000/20002 |800/16002|
|Peak TF32 Tensor TFLOPS|156/3122|500/10002|400/8002|
|Peak FP64 Tensor TFLOPS|19.5|60|48|
|Peak INT8 Tensor TOPS|624/12482|2000/40002|1600/32002|
|Peak FP16 TFLOPS (non-Tensor) |78|120|96|
|Peak BF16 TFLOPS (non-Tensor) |39|120|96|
|Peak FP32 TFLOPS (non-Tensor) |19.5|60|48|

mnicely · June 19, 2022, 7:17pm

Have you had a chance to review the H100 whitepaper provided at the bottom on the blog?

Robert_Crovella · June 19, 2022, 8:22pm

I guess that is the only question you are asking.

The A100 device has a special FP16 (non-tensor) capability for certain use cases. In those cases, the FP16 (non-tensor) throughput can be 4x the FP32 throughput. That is pretty much all I can say. It’s documented (mentioned) in the A100 whitepaper (e.g. table 1).

It’s not well documented. I don’t have further information. You’re welcome to file a bug for a doc improvement, if you wish.

whatdhack · June 21, 2022, 5:27pm

@Robert_Crovella , thanks for the FP16 explanation. How is 8x for tensor math arrived at ? Is it because 4 elements of C = Ax+B can be computed in 1 clock cycle ?

Is there any simple code available to verify these numbers ?

Robert_Crovella · June 21, 2022, 5:44pm

8x for tensor math (compared to non-tensor math) is simply a function of the design of the SM, and the ratio of tensor compute units to non-tensor compute units, coupled with the throughput of each.

I don’t have a repository of codes to point you to for verification. For tensorcore (TC) ops/math, if I needed to construct a verification of TF32, BF16, FP16, or INT8, I would use the cublas GEMM functions to do that.

TF32 (at least) doesn’t exist in the non-tensorcore space. For math available in the non-tensorcore space, its probably more difficult. Prior to TC, I would have used cublas. With the advent of TC, verifying e.g. non-TC FP16 is a bit more challenging. I guess if hard pressed I would probably try something like this. I’m not suggesting that is an exact ready-to-go example (it isn’t: it benchmarks something else) but that is the general methodology or roadmap I would try.

Topic		Replies	Views
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5256	May 31, 2019
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	6155	August 14, 2024
Question about tensor cores performance CUDA Programming and Performance	3	650	October 12, 2021
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	443	December 10, 2024
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	2636	September 11, 2023
Calculating TOPS and TFLOPS in H100 CUDA Programming and Performance	7	1283	August 2, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2545	August 12, 2017
Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute Nsight Compute	2	140	December 31, 2024
Looking for full specs on NVIDIA A5000 CUDA Programming and Performance	2	2805	June 16, 2022
inference speed not improve between FP32 vs FP16 when using tensorflow.contrib.tensorrt Jetson AGX Xavier	4	723	October 18, 2021

Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor

Related topics