Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor

I guess that is the only question you are asking.

The A100 device has a special FP16 (non-tensor) capability for certain use cases. In those cases, the FP16 (non-tensor) throughput can be 4x the FP32 throughput. That is pretty much all I can say. It’s documented (mentioned) in the A100 whitepaper (e.g. table 1).

It’s not well documented. I don’t have further information. You’re welcome to file a bug for a doc improvement, if you wish.