Hi everyone, I find a confusing phenomenon when doing self-designed kernel (fp16) profiling.

I find smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed are all zero, while the other metrics like sm__cycles_active.avg are non-zero. Is this a normal case?

I then replace the kernel with the naive PyTorch implement to check it, where the used kernel is ampere_fp16_s16816gemm_fp16_128x128_ldg8_f2f_stages_32x5_nn. The report shows the smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed is non-zero, which means it is classified into a single precision float point operation, and conflicts with the kernel’s implement.

The questions are

How to estimate the FLOPS of kernels robustly?

Are the metrics matched with the data type of implementation, if not, how can we estimate flops?

smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsed won’t include every single instruction, so it’s possible that these are all zero while other work is being done. You should be able to look at the instruction mix table in the Instruction Statistics section to see what was done. For the second part, there can still be fmul instructions in an fp16 kernel. They may not be operating on the actual data in the matrices. You can also look at the sm__ops_path_tensor_src_fp16_dst_fp16 metrics to see if you get counts there for your fp16 kernel.

Because the instructions can get very complicated (like single mma instructions doing full matrix multiplication) there isn’t a straightforward way to calculate accurate flops just using these metric counters. You can estimate it for your kernel based on your instruction mix if you include the full opcode modifiers to calculate matrix sizes etc… but that can get very complicated.

I have understood the fp16 kernel invoking fmul by looking into the instructions. But, if the smsp__sass_thread_inst_executed_op_(h/d/f)(mul/add/fma)_pred_on.sum.per_cycle_elapsedis not an accurate proxy of the flops, the official provided roofline sections can fail under the conditions I met.

Just checking what do you mean by “can fail under the conditions I met” ?
Please let us know if you still have other questions and we’ll try best to help.

NOTE: Do not use .per_cycle_elapsed if you want FLOPS. Use .per_second.

hadd counts HADD2 predicated on thread instructions

hmul counts HMUL2 predicated on thread instructions

hfma counts HFMA2 predicated on thread instructions

These metrics can be weighted (x1 for HADD2/HMUL2 and x2 for HFMA2) to create a FP16 FLOPs count. You can see the weighting in /sections/SpeedOfLight_Hierarchical{Half,Single,Double}RooflineChart.section

The smsp__sass metrics do not count Tensor instructions (HMMA, IMMA, BMMA, DMMA, etc.). Tensor OPs can be counted using the metrics (GA100):

As of 01/2024 the Tensor Roofline is only correct for GV100. It is recommend you using the above metrics to calculate Tensor OPs.

It is very likely the kernel uses some FP32 instruction. If you collect the Tensor metrics listed above or review the instructions statistics and look for HMMA (FP Matrix Multiply and Accumulate) instruction then it is likely the kernel is heavily using the Tensor cores and only has a few FP32 instructions.