Total FLOPS for the HGMMA instruction

I am using H100 GPU 80GB DRAM. My Matrix-Matrix multiplication operation is using HGMMA to do that as seen in Nsight Compute Instruction stats section.

Is there a way to get total Floating point operations using any counter? E.g. sm__sass_inst_executed_op_shared_gmma.sum [inst].

For Tensor Cores the metrics are listed below. For total use .sum and for FLOPS use .sum.per_second. In order to get the maximum rate given the measured clock frequency use .sum.peak_sustained. The metric names are in a hierarchy base upon src_dst[sparsity{on,off}].

For example, if the kernel is performing bf16 HGMMAs then the metrics would be

sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32.sum
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32.sum.per_second

or if you don’t know if the kernel is using HMMA vs. HGMMA then

sm__ops_path_tensor_src_bf16_dst_fp32.sum

TENSOR OP METRICS FOR GH100

sm__ops_path_tensor_op_bgmma_src_int1
sm__ops_path_tensor_op_bmma_src_int1
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_off
sm__ops_path_tensor_op_hgmma_src_bf16_dst_fp32_sparsity_on
sm__ops_path_tensor_op_hgmma_src_fp16
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_off
sm__ops_path_tensor_op_hgmma_src_fp16_sparsity_on
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_off
sm__ops_path_tensor_op_hgmma_src_tf32_dst_fp32_sparsity_on
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_off
sm__ops_path_tensor_op_hmma_src_bf16_dst_fp32_sparsity_on
sm__ops_path_tensor_op_hmma_src_fp16
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_off
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp16_sparsity_on
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_off
sm__ops_path_tensor_op_hmma_src_fp16_dst_fp32_sparsity_on
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_off
sm__ops_path_tensor_op_hmma_src_tf32_dst_fp32_sparsity_on
sm__ops_path_tensor_op_igmma_src_int8
sm__ops_path_tensor_op_igmma_src_int8_sparsity_off
sm__ops_path_tensor_op_igmma_src_int8_sparsity_on
sm__ops_path_tensor_op_imma_src_int8
sm__ops_path_tensor_op_imma_src_int8_sparsity_off
sm__ops_path_tensor_op_imma_src_int8_sparsity_on
sm__ops_path_tensor_src_bf16_dst_fp32
sm__ops_path_tensor_src_fp16
sm__ops_path_tensor_src_fp64
sm__ops_path_tensor_src_fp8
sm__ops_path_tensor_src_fp8_sparsity_off
sm__ops_path_tensor_src_fp8_sparsity_on
sm__ops_path_tensor_src_int1
sm__ops_path_tensor_src_int8
sm__ops_path_tensor_src_tf32_dst_fp32

Thank you for the detailed answer. So, these counters will give total floating point operations for the HGMMA on the tensor cores. Can some floating point operations happen on other cores (e.g. CUDA cores) for HGMMA? If yes, then how to calculate those total floating point operations on other cores?

No. The HGMMA or HMMA instruction use the MMA pipe (Tensor Cores). Instructions such as FFMA/FADD/FMUL use the FMA* pipes (CUDA Cores).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.