How to measure Tensor FLOPs?

Lots of further links:

I had once a similar issue, but without solution:


And this question is also related:


There was an issue (once?) with stored roofline of Tensor Cores for different GPU architectures (basically rooflines have been only correct for GV100 Volta GPUs):

or


Those detailed counters could help you calculate exact FLOPs (but only if you know some details of your instructions; it is not enough for fully unknown code to deduce FLOPs):

or

or