How to understanding stall_wait and sampling data

Setting --sampling-interval 0 will try to sample every 32 cycles; however, the performance monitor may not be able to maintain that rate. --sampling-interval 0 can only be maintained on smaller GPUs.

On each sample interval the program counter sampler uses a round robin method to select the sampled warp.

When I run this kernel with one warp on each warp scheduler, the stall_wait is 7 times to stall_selected in each HMMA. This seems make sense, but why the stall is in stall_wait but not in stall_math, since stall_wait is caused by dependency and stall_math is caused by pipeline busy.

For fixed latency execution pipes the wait reason is reported between issue of the instruction and when the instruction can next issue. This wait is dependent upon (a) when static dependencies are ready, and (b) when the pipeline is available. For example on Volta - GA100 the FMA pipe can issue an instruction every 2 cycles. If a kernel issues a chain of independent FFMA instructions the pattern would be (S = selected and W=stall_wait, N=stall_not_selected, M=stall_math)

S W S W

The W is inserted as the pipeline is known not to be available for 2 cycles.

If a kernel issues a chain of dependent FFMA instructions then the waits are increased to the dependent latency so the pattern would change to

S W W W S W W W

stall_math comes in to play when there are multiple eligible warps (not stalled) trying to issue to a math pipeline that is in use. In the 2 warp case you mention with 2 independent HMMA I would expect the following pattern:

            0                 1                   2                   3                   4                   5
            1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
single warp
warp 0      S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W 

two warp
warp 0      S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W
warp 1      N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M

The question is that, when I run this kernel with two warps on one warp scheduler, the cycle doubles, but the sampling data becomes half, and stall_wait is still 7 times to stall_selected. In my understanding, this time the sampling data should double since overall cycle is double, and stall_wait should be 15 times to stall_selected since each warp should waiting two HMMA before it can issue next HMMA.

As you can see above the ratio of stall_wait to selected does not change.

You are correct that the total number of samples should double. I would have to look at the report to determine why this is not the case.

Why only the first mma in the loop being sampled so many times and stall_math only appear in this instruction?

This generally occurs if the warp scheduler continues to prioritize the warp that was last selected. This results in one warp making forward progress while the other warp stays at the same program counter for a long period of time.

2 Likes