I am using H100 GPU 80GB DRAM. My Matrix-Matrix multiplication operation is using HGMMA to do that as seen in Nsight Compute Instruction stats section.

Is there a way to get total Floating point operations using any counter? E.g. sm__sass_inst_executed_op_shared_gmma.sum [inst].

For Tensor Cores the metrics are listed below. For total use .sum and for FLOPS use .sum.per_second. In order to get the maximum rate given the measured clock frequency use .sum.peak_sustained. The metric names are in a hierarchy base upon src_dst[sparsity{on,off}].

For example, if the kernel is performing bf16 HGMMAs then the metrics would be

Thank you for the detailed answer. So, these counters will give total floating point operations for the HGMMA on the tensor cores. Can some floating point operations happen on other cores (e.g. CUDA cores) for HGMMA? If yes, then how to calculate those total floating point operations on other cores?