Understanding Tensor Pipe Throughput and Throttle Stalls

Hi All,

I was hoping someone could clear my understanding about whether or not there is a correlation between pipe throttles and tensor pipe throughput. I have a gemm kernel: and in the summary section, ncu reports math pipe throttles as follows:

On average, each warp of this workload spends 4.4 cycles being stalled waiting for the execution pipe to be available. This stall occurs when all active warps execute their next instruction on a specific, oversubscribed math pipeline. Try to increase the number of active warps to hide the existent latency or try changing the instruction mix to utilize all available pipelines in a more balanced way. This stall type represents about 47.6% of the total average of 9.2 cycles between issuing two instructions.

In the source view, I switch to `stall_math` and stall_math(not issued) ,and as expected, ncu displays its yellow triangle in front of the HMMA instruction. (as seen in the attached screenshot).

So I have 6 ldsm instructions (in total) followed by 64 HMMAM1688 per warp (I have 8 warps in total), in the first-in-first-out order, that is I issue all my ldsms and then issue all my mmainstructions. After each pair of ldsm. I can perform 8 MMAs, so I hope that I am very reasonably able to overlap mmas and ldsm (i.e. while the mmas enabled by the first pair of ldsm runs, the remaining pair of ldsm can be completed in the background and the loop continues).

I now understand that maybe the mma instruction queue depth is not sufficient to accommodate 64 mma instructions, and therefore, warp does not issue the instruction.

However, I am not seeing the peak tensor core utilization at 100% either. If my warp is awaiting the pipe to be available, I would expect the tensor core to achieve peak flop per cycle. However, it only achieves around 80% (in contrast, cublasLT achieves 97% on the same hardware).

I was wondering if my expectation is correct, and what the correlation between the two is, and what I can do to achieve 100% tensor core utilization.

Thanks a lot !

If you have 2 warps per SM sub-partition (SMSP) the chance of getting a math throttle is very high.

Let’s diagram out 2 warps trying to issue multiple HMMA and let’s assume the issue frequency of the HMMA is 8 cycles.


            0                   1                   2      
Cycles      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
Warp 0
    Inst#   0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    Stall   S W W W W W W W S W W W W W W W N M M M M M M M
Warp 4
    Inst#   0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
    Stall   N M M M M M M M N M M M M M M M S W W W W W W W
    
In 24 cycles you reported the following stall reasons:
3   S - stall_selected
3   N - stall_not_selected
21  W - stall_wait
21  M - stall_math_throttle

The program counter and warp state sampling is likely not going to help you find bubbles in the MMA pipe.

The warp level MMA instructions (HMMA, BMMA, IMMA, DMMA, …) are fixed latency instructions with no instruction queue. The instructions execute like FFMA, FMUL, IMAD, IADD, etc. WGMMA (GH100) and UTCMMA (Blackwell) operate very differently than warp MMA.

LDSM is an LSU instruction. The LSU pipe is variable latency and is shared across all 4 sub-partitions.

LDSM and any other non-MMA instructions are likely to keep the kernel from executing MMA instructions.

Thank you so much for your detailed reply!

I see, so the reason each mma takes more than the desired number of cycles is probably because of unresolved data dependency:

From a warp scheduler’s perspective:

Instruction issue —> Instruction Dependency analysis —> Check All resolved —> MMA executes.

Since all these may be the internal steps in the mma instruction so to speak, any unresolved data dependency could manifest as MMA taking more than the required number of cycles, preventing it from reaching peak flop/clock. And the pipe throttle is completely orthogonal to this.

Also, if I understand correctly, does the MMA instruction make the warp wait for it to complete ? or does the warp move ahead, if the next instruction is not an mma one ? for example something like HMMA, LDSM, LDSM, LDSM, HMMA so from a clock perspective (achieving like 1 IPC ?):

If it’s an MMA instruction, I am assuming the warp scheduler would be switch between the warps assigned to it, and issue instruction if it can.

Cycles:                            0,    1,     2,     3     4
Instruction Issue:                HMMA  LDSM    LDSM   LDSM  STALL (since HMMA throughput is 1 every 8 cycles ? )
(assuming port avalaibility) 

(in the above assume fully resolved data dependency for the HMMA).

I am trying to establish whether I can overlap some other instructions with the warp level mma instructions or not, so that they all execute concurrently.

Thanks a lot !

PS: I am aware that mma ptx instruction has the .sync qualifier, which states that all threads in the warp wait to execute the instruction, but I am not fully clear on the meaning of execute hear, would execute mean issue + retire, or simply issue (i.e. the warp waits for all the threads to arrive at that instruction and then issue and move ahead)

DATA DEPENDENCY STALLS
The warp scheduler checks a warps eligibility each cycle. For fixed latency instruction there is a fixed cycle requirement known by the compiler (e.g. data dependency between FFMA and FFMA is 4 cycles). The compiler can issue Instruction Level Parallelism (ILP) to cover some the dependent latency. If not then the warp will report stalled_wait. For variable latency instruction dependencies such as waiting on a shared memory load, global memory load, or special function unit instruction the warp dependent instruction will cause the warp to report stalled_short_scoreboard or stalled_long_scoreboard. If the warp is not stalled on a data dependency then it is potentially eligible.

EXECUTION STALLS ANALYSIS
The next set of stalls is execution dependency. If the pipelines for a warp’s next instruction is occupied the warp is stalled on stalled_math_throttle. This is common for ALU, FMA, FMA64 (on 100 class), warp MMA, and FP64 (100 class) instructions. For variable latency pipelines the instruction is issued into an instruction queue. If the instruction queue is full then the warp will report stalled_mio_throttle, stalled_lg_throttle, stalled_tex_throttle, etc.

ELIGIBLE WARPS
If a warp is not stalled then the warp is eligible. The warp scheduler can pick one warp per cycle to issue. The warp that is selected will report stalled_selected and the warps that are not selected will report warp stalled_not_selected. While selected is not a stall the reason it is included so the sum of all the stall reasons sm[sp]__warps_issue_stalled_{reason} == sm[sp]__warps_active.

DISPATCH STALL
Once the warp is issued the warp must read all its registers (some variable latency instructions read registers later) before it can dispatch to the execution unit. If there is too much contention on a register file bank then the warp can stall the issue pipeline in which case many warps may report 1stalled_dispatch_stall` until the warp can read all necessary registers.

Since all these may be the internal steps in the mma instruction so to speak, any unresolved data dependency could manifest as MMA taking more than the required number of cycles, preventing it from reaching peak flop/clock. And the pipe throttle is completely orthogonal to this.

Most of what you have listed is before instruction issue. stalled_dispatch_stall is the one case where MMA may take longer than expected.

Also, if I understand correctly, does the MMA instruction make the warp wait for it to complete ? or does the warp move ahead, if the next instruction is not an mma one ?

Instructions are executed in order. Independent instructions can be interleaved between MMAs from the same warp. However, on back to back MMA instructions the second MMA will report stalled_wait until the MMA pipe is not occupied as shown in my time diagram.

If it’s an MMA instruction, I am assuming the warp scheduler would be switch between the warps assigned to it, and issue instruction if it can.

The warp scheduler can select from a different warp. In your example, there are only 2 warps per SM sub-partition so the ability to hide the MMA pipe stalls with LDSM is likely happening but as you have 2 warps doing 2 LDSM then 8 HMMAs. LDSM is as shared instruction so at maximum speed you can dispatch in 2 instructions in 2 cycles if the instruction queue is empty in which case there may be quick contention on the MMA pipe. I do not know the data dependency between the LDSM and HMMA instructions.

PS: I am aware that mma ptx instruction has the .sync qualifier, which states that all threads in the warp wait to execute the instruction, but I am not fully clear on the meaning of execute hear, would execute mean issue + retire, or simply issue (i.e. the warp waits for all the threads to arrive at that instruction and then issue and move ahead)

For warp MMA instructions on most chips there is a requirement that all threads in the warp must be active and predicated true for the {BDIHQ}MMA instruction or the behavior is undefined. The requirement should only be on the issue.