How to understanding stall_wait and sampling data

783161219 · November 24, 2021, 2:32am

I have some problem with how nsight compute profiling each instruction and what does stall_wait means.
I set --sampling-interval 0 when I run nsight, which means sampling every 32 cycles. If I run a kernel with only one warp, the overall sampling data multiply 32 equals to the cycle of the kernel. But this is not true when I increase the wrap number. Is this mean that nsight can’t sample all warp scheduler in each sampling period.
There is another question which puzzled me further, I write a kernel only consist of many HMMA, and there is no dependency between four of them. When I run this kernel with one warp on each warp scheduler, the stall_wait is 7 times to stall_selected in each HMMA. This seems make sense, but why the stall is in stall_wait but not in stall_math, since stall_wait is caused by dependency and stall_math is caused by pipeline busy. The question is that, when I run this kernel with two warps on one warp scheduler, the cycle doubles, but the sampling data becomes half, and stall_wait is still 7 times to stall_selected. In my understanding, this time the sampling data should double since overall cycle is double, and stall_wait should be 15 times to stall_selected since each warp should waiting two HMMA before it can issue next HMMA.
So is there something wrong with my understanding, thank you.

felix_dt · November 24, 2021, 7:36am

Only a single, random warp is sampled each period. You can refer to Kernel Profiling Guide :: Nsight Compute Documentation

I don’t yet have information on your second question.

783161219 · November 24, 2021, 8:32am

In my kernel there is a main loop which is not unrolled and a unrolled subloop

Seems that the first mma of the main loop is sampled so many times and has very huge stall_math, which is equal to the sum of stall_wait of all other mmas.

Why only the first mma in the loop being sampled so many times and stall_math only appear in this instruction?

Greg · November 30, 2021, 5:45pm

Setting --sampling-interval 0 will try to sample every 32 cycles; however, the performance monitor may not be able to maintain that rate. --sampling-interval 0 can only be maintained on smaller GPUs.

On each sample interval the program counter sampler uses a round robin method to select the sampled warp.

When I run this kernel with one warp on each warp scheduler, the stall_wait is 7 times to stall_selected in each HMMA. This seems make sense, but why the stall is in stall_wait but not in stall_math, since stall_wait is caused by dependency and stall_math is caused by pipeline busy.

For fixed latency execution pipes the wait reason is reported between issue of the instruction and when the instruction can next issue. This wait is dependent upon (a) when static dependencies are ready, and (b) when the pipeline is available. For example on Volta - GA100 the FMA pipe can issue an instruction every 2 cycles. If a kernel issues a chain of independent FFMA instructions the pattern would be (S = selected and W=stall_wait, N=stall_not_selected, M=stall_math)

S W S W

The W is inserted as the pipeline is known not to be available for 2 cycles.

If a kernel issues a chain of dependent FFMA instructions then the waits are increased to the dependent latency so the pattern would change to

S W W W S W W W

stall_math comes in to play when there are multiple eligible warps (not stalled) trying to issue to a math pipeline that is in use. In the 2 warp case you mention with 2 independent HMMA I would expect the following pattern:

            0                 1                   2                   3                   4                   5
            1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0
single warp
warp 0      S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W S W W W W W W W 

two warp
warp 0      S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W
warp 1      N M M M M M M M N M M M M M M M S W W W W W W W S W W W W W W W N M M M M M M M N M M M M M M M

The question is that, when I run this kernel with two warps on one warp scheduler, the cycle doubles, but the sampling data becomes half, and stall_wait is still 7 times to stall_selected. In my understanding, this time the sampling data should double since overall cycle is double, and stall_wait should be 15 times to stall_selected since each warp should waiting two HMMA before it can issue next HMMA.

As you can see above the ratio of stall_wait to selected does not change.

You are correct that the total number of samples should double. I would have to look at the report to determine why this is not the case.

Why only the first mma in the loop being sampled so many times and stall_math only appear in this instruction?

This generally occurs if the warp scheduler continues to prioritize the warp that was last selected. This results in one warp making forward progress while the other warp stays at the same program counter for a long period of time.

783161219 · December 1, 2021, 2:17am

Thank you for the detailed explain, I have a basic understanding of stall reasons now, thank you very much.

system · December 15, 2021, 2:17am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stall reasons summation is not 100% Nsight Compute	7	1000	October 12, 2021
How to analysis the stall wait in this HMMA case Nsight Compute	3	467	October 31, 2024
How to keep the float pipe busy? CUDA Programming and Performance	7	704	April 23, 2019
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	857	April 18, 2023
Question about PC sampling Nsight Compute	3	513	December 20, 2023
How are the cycles of different warp stall reasons calculated?(In the section warp state statistics) Nsight Compute	1	495	September 6, 2022
Does the STG.E instruction on Ampere occupy two clock cycles of the FMAHeavy pipeline? CUDA Programming and Performance	11	783	December 10, 2023
Memory Workload Analysis related metrics Nsight Compute	1	1882	January 30, 2020
Case study: [TensorCore backed Conv] What makes a huge "Stall Wait"? Deep Learning (Training & Inference) mixed-precision	0	693	April 25, 2019
questions about warp scheduling CUDA Programming and Performance	5	1322	December 5, 2016

How to understanding stall_wait and sampling data

Related topics