How to understanding stall_wait and sampling data

I have some problem with how nsight compute profiling each instruction and what does stall_wait means.
I set --sampling-interval 0 when I run nsight, which means sampling every 32 cycles. If I run a kernel with only one warp, the overall sampling data multiply 32 equals to the cycle of the kernel. But this is not true when I increase the wrap number. Is this mean that nsight can’t sample all warp scheduler in each sampling period.
There is another question which puzzled me further, I write a kernel only consist of many HMMA, and there is no dependency between four of them. When I run this kernel with one warp on each warp scheduler, the stall_wait is 7 times to stall_selected in each HMMA. This seems make sense, but why the stall is in stall_wait but not in stall_math, since stall_wait is caused by dependency and stall_math is caused by pipeline busy. The question is that, when I run this kernel with two warps on one warp scheduler, the cycle doubles, but the sampling data becomes half, and stall_wait is still 7 times to stall_selected. In my understanding, this time the sampling data should double since overall cycle is double, and stall_wait should be 15 times to stall_selected since each warp should waiting two HMMA before it can issue next HMMA.
So is there something wrong with my understanding, thank you.

Only a single, random warp is sampled each period. You can refer to Kernel Profiling Guide :: Nsight Compute Documentation

I don’t yet have information on your second question.

In my kernel there is a main loop which is not unrolled and a unrolled subloop

Seems that the first mma of the main loop is sampled so many times and has very huge stall_math, which is equal to the sum of stall_wait of all other mmas.

Why only the first mma in the loop being sampled so many times and stall_math only appear in this instruction?

Setting --sampling-interval 0 will try to sample every 32 cycles; however, the performance monitor may not be able to maintain that rate. --sampling-interval 0 can only be maintained on smaller GPUs.

On each sample interval the program counter sampler uses a round robin method to select the sampled warp.

When I run this kernel with one warp on each warp scheduler, the stall_wait is 7 times to stall_selected in each HMMA. This seems make sense, but why the stall is in stall_wait but not in stall_math, since stall_wait is caused by dependency and stall_math is caused by pipeline busy.

For fixed latency execution pipes the wait reason is reported between issue of the instruction and when the instruction can next issue. This wait is dependent upon (a) when static dependencies are ready, and (b) when the pipeline is available. For example on Volta - GA100 the FMA pipe can issue an instruction every 2 cycles. If a kernel issues a chain of independent FFMA instructions the pattern would be (S = selected and W=stall_wait, N=stall_not_selected, M=stall_math)

S W S W

The W is inserted as the pipeline is known not to be available for 2 cycles.

If a kernel issues a chain of dependent FFMA instructions then the waits are increased to the dependent latency so the pattern would change to

S W W W S W W W

stall_math comes in to play when there are multiple eligible warps (not stalled) trying to issue to a math pipeline that is in use. In the 2 warp case you mention with 2 independent HMMA I would expect the following pattern:

            0                 1                   2                   3                   4                   5
            1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
single warp
warp 0      S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W 

two warp
warp 0      S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W
warp 1      N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M

The question is that, when I run this kernel with two warps on one warp scheduler, the cycle doubles, but the sampling data becomes half, and stall_wait is still 7 times to stall_selected. In my understanding, this time the sampling data should double since overall cycle is double, and stall_wait should be 15 times to stall_selected since each warp should waiting two HMMA before it can issue next HMMA.

As you can see above the ratio of stall_wait to selected does not change.

You are correct that the total number of samples should double. I would have to look at the report to determine why this is not the case.

Why only the first mma in the loop being sampled so many times and stall_math only appear in this instruction?

This generally occurs if the warp scheduler continues to prioritize the warp that was last selected. This results in one warp making forward progress while the other warp stays at the same program counter for a long period of time.

1 Like

Thank you for the detailed explain, I have a basic understanding of stall reasons now, thank you very much.