I tested torch.matmul (which uses cuBLAS) for GEMM on an H100 with M=N=K=2048. According to the roofline model, it is compute-bound for DRAM and L1, but memory-bound for L2. However, in NCU, the largest warp stall I observed was caused by long scoreboard, followed by GMMA. Why is that? I initially thought GMMA would be the largest source of stalls.
Warp group instructions (HGMMA) have two primary stall locations:
On the WARPGROUP.ARRIVE waiting for all warps in the warp group to state all dependencies are resolved in order to issue HGMMA instructions (Stall GMMA).
On the WARPGROUP.DEPBAR that states all warps in the group have completed the HGMMA instructions (Stall Barrier).
The sum of Stall GMMA (2.95) + Stall Barrier (2.10) = 5.05 cycles/instruction executed is close to Stall Long Scoreboard (5.2).
There are a lot of other instructions between each HGMMA instruction hiding the latency of the HGMMA instructions.