Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute

Hi everyone, I find a confusing phenomenon when doing self-designed kernel (fp16) profiling.

  1. I find smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed are all zero, while the other metrics like sm__cycles_active.avg are non-zero. Is this a normal case?
  2. I then replace the kernel with the naive PyTorch implement to check it, where the used kernel is ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn. The report shows the smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed is non-zero, which means it is classified into a single precision float point operation, and conflicts with the kernel’s implement.

The questions are

  1. How to estimate the FLOPS of kernels robustly?
  2. Are the metrics matched with the data type of implementation, if not, how can we estimate flops?

The used command is

ncu --sampling-interval 0 \
--replay-mode kernel  \
--target-processes all \
-f --set roofline \
--range-filter :[1]: \
-k regex:gemm -o test1 \
python ncu_subprocess.py 

Any response is appreciated! Thanks!

smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed won’t include every single instruction, so it’s possible that these are all zero while other work is being done. You should be able to look at the instruction mix table in the Instruction Statistics section to see what was done. For the second part, there can still be fmul instructions in an fp16 kernel. They may not be operating on the actual data in the matrices. You can also look at the sm__ops_path_tensor_src_fp16_dst_fp16 metrics to see if you get counts there for your fp16 kernel.

Because the instructions can get very complicated (like single mma instructions doing full matrix multiplication) there isn’t a straightforward way to calculate accurate flops just using these metric counters. You can estimate it for your kernel based on your instruction mix if you include the full opcode modifiers to calculate matrix sizes etc… but that can get very complicated.

Thank you for the reply!

I have understood the fp16 kernel invoking fmul by looking into the instructions. But, if the smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsedis not an accurate proxy of the flops, the official provided roofline sections can fail under the conditions I met.

Hi, @pannenets.f

Just checking what do you mean by “can fail under the conditions I met” ?
Please let us know if you still have other questions and we’ll try best to help.

Please submit a new topic if you have more questions. We’ll do our best to help. Thanks !

Nsight Compute does not have a single FLOPs counter that rolls-up all data types.

smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum

NOTE: Do not use .per_cycle_elapsed if you want FLOPS. Use .per_second.

  • hadd counts HADD2 predicated on thread instructions
  • hmul counts HMUL2 predicated on thread instructions
  • hfma counts HFMA2 predicated on thread instructions

These metrics can be weighted (x1 for HADD2/HMUL2 and x2 for HFMA2) to create a FP16 FLOPs count. You can see the weighting in /sections/SpeedOfLight_Hierarchical{Half,Single,Double}RooflineChart.section

The smsp__sass metrics do not count Tensor instructions (HMMA, IMMA, BMMA, DMMA, etc.). Tensor OPs can be counted using the metrics (GA100):

sm__ops_path_tensor_src_bf16_dst_fp32
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_off
sm__ops_path_tensor_src_bf16_dst_fp32_sparsity_on
sm__ops_path_tensor_src_fp16_dst_fp16
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_off
sm__ops_path_tensor_src_fp16_dst_fp16_sparsity_on
sm__ops_path_tensor_src_fp16_dst_fp32
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_off
sm__ops_path_tensor_src_fp16_dst_fp32_sparsity_on
sm__ops_path_tensor_src_fp64
sm__ops_path_tensor_src_int1
sm__ops_path_tensor_src_int4
sm__ops_path_tensor_src_int4_sparsity_off
sm__ops_path_tensor_src_int4_sparsity_on
sm__ops_path_tensor_src_int8
sm__ops_path_tensor_src_int8_sparsity_off
sm__ops_path_tensor_src_int8_sparsity_on
sm__ops_path_tensor_src_tf32_dst_fp32
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_off
sm__ops_path_tensor_src_tf32_dst_fp32_sparsity_on

As of 01/2024 the Tensor Roofline is only correct for GV100. It is recommend you using the above metrics to calculate Tensor OPs.

It is very likely the kernel uses some FP32 instruction. If you collect the Tensor metrics listed above or review the instructions statistics and look for HMMA (FP Matrix Multiply and Accumulate) instruction then it is likely the kernel is heavily using the Tensor cores and only has a few FP32 instructions.