Question About Warp Stalls Observed in GEMM Profiling on H100

202476410arsmart · January 1, 2025, 8:59am

I tested torch.matmul (which uses cuBLAS) for GEMM on an H100 with M=N=K=2048. According to the roofline model, it is compute-bound for DRAM and L1, but memory-bound for L2. However, in NCU, the largest warp stall I observed was caused by long scoreboard, followed by GMMA. Why is that? I initially thought GMMA would be the largest source of stalls.

output-file-full.nsight-cuprof-report.zip (41.3 MB)
output-file-roofline.nsight-cuprof-report.zip (40.4 MB)

Greg · January 6, 2025, 6:31pm

Warp group instructions (HGMMA) have two primary stall locations:

On the WARPGROUP.ARRIVE waiting for all warps in the warp group to state all dependencies are resolved in order to issue HGMMA instructions (Stall GMMA).
On the WARPGROUP.DEPBAR that states all warps in the group have completed the HGMMA instructions (Stall Barrier).

The sum of Stall GMMA (2.95) + Stall Barrier (2.10) = 5.05 cycles/instruction executed is close to Stall Long Scoreboard (5.2).

There are a lot of other instructions between each HGMMA instruction hiding the latency of the HGMMA instructions.

Topic		Replies	Views
How to analysis the stall wait in this HMMA case Nsight Compute	3	327	October 31, 2024
Long/Short Scoreboard Stall Nsight Compute	1	1181	April 24, 2023
Find load store stalls Nsight Compute cuda	3	682	January 12, 2024
Memory Workload Analysis related metrics Nsight Compute	1	1874	January 30, 2020
Stall reasons summation is not 100% Nsight Compute	7	992	October 12, 2021
Long scoreboard stall meanings? Nsight Compute	2	5015	October 18, 2022
Optimize CUDA kernel with low eligible warps and stall long scoreborad CUDA Programming and Performance cuda	0	211	July 11, 2023
Long cycles on stall_long_sb Nsight Compute cuda	4	1480	October 14, 2021
Warp stall reduded but performance not improved CUDA Programming and Performance nsight	0	399	October 23, 2022
Stall reason "Long scoreboard" on instruction that does not even involve out-of-SM memory Visual Profiler and nvprof	7	731	March 27, 2024

Question About Warp Stalls Observed in GEMM Profiling on H100

Related topics