Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute

Hi everyone, I find a confusing phenomenon when doing self-designed kernel (fp16) profiling.

  1. I find smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed are all zero, while the other metrics like sm__cycles_active.avg are non-zero. Is this a normal case?
  2. I then replace the kernel with the naive PyTorch implement to check it, where the used kernel is ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn. The report shows the smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed is non-zero, which means it is classified into a single precision float point operation, and conflicts with the kernel’s implement.

The questions are

  1. How to estimate the FLOPS of kernels robustly?
  2. Are the metrics matched with the data type of implementation, if not, how can we estimate flops?

The used command is

ncu --sampling-interval 0 \
--replay-mode kernel  \
--target-processes all \
-f --set roofline \
--range-filter :[1]: \
-k regex:gemm -o test1 \
python ncu_subprocess.py 

Any response is appreciated! Thanks!

smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed won’t include every single instruction, so it’s possible that these are all zero while other work is being done. You should be able to look at the instruction mix table in the Instruction Statistics section to see what was done. For the second part, there can still be fmul instructions in an fp16 kernel. They may not be operating on the actual data in the matrices. You can also look at the sm__ops_path_tensor_src_fp16_dst_fp16 metrics to see if you get counts there for your fp16 kernel.

Because the instructions can get very complicated (like single mma instructions doing full matrix multiplication) there isn’t a straightforward way to calculate accurate flops just using these metric counters. You can estimate it for your kernel based on your instruction mix if you include the full opcode modifiers to calculate matrix sizes etc… but that can get very complicated.

Thank you for the reply!

I have understood the fp16 kernel invoking fmul by looking into the instructions. But, if the smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsedis not an accurate proxy of the flops, the official provided roofline sections can fail under the conditions I met.