How to analysis the stall wait in this HMMA case

I use cutlass to write a A16W4(fp16 A matrix, int4 B matrix) gemm kernel. Then I use nsight compute to analyze it. In source page, I get

And I have read this to understand stalled wait, it said

so my question is

  1. the issue rate of HMMA instructions of ampere architecture.
  2. I was wondering why there are so many stalled_waits instead of stalled_math_throttle between HMMA instructions since there are no data dependency between two consecutive instrs。

I can provide the ncu file if it is needed.

  1. the issue rate of HMMA instructions of ampere architecture.

To my knowledge the issue rate of each MMA instruction variant is not documented. From the screenshot it can be estimated for every 7 wait samples there is 1 selected sample ==> 8 cycles for HMMA.16816.F32 on this chip.

  1. I was wondering why there are so many stalled_waits instead of stalled_math_throttle between HMMA instructions since there are no data dependency between two consecutive instrs。

stalled_math_throttle only occurs if the kernel has multiple warp on the same sub-partition that are using the same math pipeline. cutlass often tries to have a single warp issue a sequence of MMAs then switch to a different warp to issue a sequence of MMA instructions. In this case the second warp will not be stalled on the MMA pipe. The screenshot does not have sufficient information for that conclusion.

The not_selected stall on MMA implies the MMA pipe is free; however, the warp scheduler choose a different warp to issue on that cycle.

How was this conclusion reached? Could you describe the specific derivation process to me?
And what does warp scheduler do during the 7 wait sample.

launch config of this kernel is grid = (10, 14, 1), block = (256, 1, 1). If I understand you correctly, it has 8 warps in a block, in another word, 2 warps for each sub-partition. And to my knowledge, the hardware will decide which warp to be scheduled. So what does “cutlass often tries to” here mean?

This is my ncu file, with m =1200, n = 3548, k = 18944. Change to .log since .ncu-rep can not be uploaded.
gemm.log (20.2 MB)

Stilling waiting for you help.