NVIDIA’s specification says that an A100’s tensor cores have a peak performance of 312 TFLOPs. (I know… the usual disclaimers… that’s best-case and not achievable for real applications.)
What I am wondering is: are we defining 1 TMAC as 2 TFLOPs? Or, when NVIDIA says TFLOP, do they mean TMAC?
All calculations of this type for discrete CUDA GPUs or the CUDA GPU component of a SoC count the multiplication and addition operations as separate floating-point ops.
A number like 312 is counting 156 for addition ops and 156 for multiplication ops.
P.S. It’s great to see that you’re still on the forums, Robert. I think you answered the first of my questions (on StackOverflow) over 10 years ago now.