NVIDIA’s specification says that an A100’s tensor cores have a peak performance of 312 TFLOPs. (I know… the usual disclaimers… that’s best-case and not achievable for real applications.)
What I am wondering is: are we defining 1 TMAC as 2 TFLOPs? Or, when NVIDIA says TFLOP, do they mean TMAC?
All calculations of this type for discrete CUDA GPUs or the CUDA GPU component of a SoC count the multiplication and addition operations as separate floating-point ops.
A number like 312 is counting 156 for addition ops and 156 for multiplication ops.
P.S. It’s great to see that you’re still on the forums, Robert. I think you answered the first of my questions (on StackOverflow) over 10 years ago now.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.