Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor

Wondering how the theoretical TFLOPS numbers are calculated for lower precisions. In the Table 3 of blog post on H100 ( https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ ) , copied below, TFLOPS for various precisions are available. For A100, BF16 (non-tensor) seems to be double that of FP32. That makes sense as 2 ops of BF16 are executed in place of 1 op of FP32. However FP16 ( non-tensor) appears to be further 2x higher - what is the reason for that ? TF32 (tensor) is 8x of FP32 (non-tensor), and BF16 (tensor) is also 8x of BF16 ( non-tensor)

GPU Features NVIDIA A100 NVIDIA H100 SXM5 1 NVIDIA H100 PCIe
|Peak FP16 Tensor TFLOPS with FP16 Accumulate|312/6242|1000/20002|800/16002|
|Peak FP16 Tensor TFLOPS with FP32 Accumulate|312/6242|1000/20002|800/16002|
|Peak BF16 Tensor TFLOPS with FP32 Accumulate|312/6242|1000/20002 |800/16002|
|Peak TF32 Tensor TFLOPS|156/3122|500/10002|400/8002|
|Peak FP64 Tensor TFLOPS|19.5|60|48|
|Peak INT8 Tensor TOPS|624/12482|2000/40002|1600/32002|
|Peak FP16 TFLOPS (non-Tensor) |78|120|96|
|Peak BF16 TFLOPS (non-Tensor) |39|120|96|
|Peak FP32 TFLOPS (non-Tensor) |19.5|60|48|

Have you had a chance to review the H100 whitepaper provided at the bottom on the blog?

I guess that is the only question you are asking.

The A100 device has a special FP16 (non-tensor) capability for certain use cases. In those cases, the FP16 (non-tensor) throughput can be 4x the FP32 throughput. That is pretty much all I can say. It’s documented (mentioned) in the A100 whitepaper (e.g. table 1).

It’s not well documented. I don’t have further information. You’re welcome to file a bug for a doc improvement, if you wish.

@Robert_Crovella , thanks for the FP16 explanation. How is 8x for tensor math arrived at ? Is it because 4 elements of C = Ax+B can be computed in 1 clock cycle ?

Is there any simple code available to verify these numbers ?

8x for tensor math (compared to non-tensor math) is simply a function of the design of the SM, and the ratio of tensor compute units to non-tensor compute units, coupled with the throughput of each.

I don’t have a repository of codes to point you to for verification. For tensorcore (TC) ops/math, if I needed to construct a verification of TF32, BF16, FP16, or INT8, I would use the cublas GEMM functions to do that.

TF32 (at least) doesn’t exist in the non-tensorcore space. For math available in the non-tensorcore space, its probably more difficult. Prior to TC, I would have used cublas. With the advent of TC, verifying e.g. non-TC FP16 is a bit more challenging. I guess if hard pressed I would probably try something like this. I’m not suggesting that is an exact ready-to-go example (it isn’t: it benchmarks something else) but that is the general methodology or roadmap I would try.