Hi everyone, I find a confusing phenomenon when doing self-designed kernel (fp16) profiling.

- I find
`smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed`

are all zero, while the other metrics like `sm__cycles_active.avg`

are non-zero. Is this a normal case?
- I then replace the kernel with the naive PyTorch implement to check it, where the used kernel is
`ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn`

. The report shows the `smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed`

is non-zero, which means it is classified into a single precision float point operation, and conflicts with the kernel’s implement.

The questions are

- How to estimate the FLOPS of kernels robustly?
- Are the metrics matched with the data type of implementation, if not, how can we estimate flops?

The used command is

```
ncu --sampling-interval 0 \
--replay-mode kernel \
--target-processes all \
-f --set roofline \
--range-filter :[1]: \
-k regex:gemm -o test1 \
python ncu_subprocess.py
```

Any response is appreciated! Thanks!

smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed won’t include every single instruction, so it’s possible that these are all zero while other work is being done. You should be able to look at the instruction mix table in the Instruction Statistics section to see what was done. For the second part, there can still be fmul instructions in an fp16 kernel. They may not be operating on the actual data in the matrices. You can also look at the sm__ops_path_tensor_src_fp16_dst_fp16 metrics to see if you get counts there for your fp16 kernel.

Because the instructions can get very complicated (like single mma instructions doing full matrix multiplication) there isn’t a straightforward way to calculate accurate flops just using these metric counters. You can estimate it for your kernel based on your instruction mix if you include the full opcode modifiers to calculate matrix sizes etc… but that can get very complicated.

Thank you for the reply!

I have understood the fp16 kernel invoking fmul by looking into the instructions. But, if the `smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed`

is not an accurate proxy of the flops, the official provided roofline sections can fail under the conditions I met.