Stall reasons summation is not 100%

Although I collected all stall related metrics, here, it seems that the sum of average (or maximum) values is far less than 100%.

    Metric Name                                                     Metric Unit Minimum   Maximum   Average
    --------------------------------------------------------------- ----------- --------- --------- ---------
    smsp__warp_issue_stalled_barrier_per_warp_active.pct            %           4.186474  7.639925  6.399756
    smsp__warp_issue_stalled_dispatch_stall_per_warp_active.pct     %           2.238771  2.710122  2.538968
    smsp__warp_issue_stalled_drain_per_warp_active.pct              %           0.002744  0.005988  0.003129
    smsp__warp_issue_stalled_imc_miss_per_warp_active.pct           %           0.089172  0.197238  0.104320
    smsp__warp_issue_stalled_lg_throttle_per_warp_active.pct        %           0.000000  0.000004  0.000000
    smsp__warp_issue_stalled_long_scoreboard_per_warp_active.pct    %           0.804119  1.863121  0.945215
    smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active.pct %           1.998483  2.132997  2.027976
    smsp__warp_issue_stalled_membar_per_warp_active.pct             %           0.000000  0.000000  0.000000
    smsp__warp_issue_stalled_mio_throttle_per_warp_active.pct       %           0.282937  0.387940  0.299651
    smsp__warp_issue_stalled_misc_per_warp_active.pct               %           0.000061  0.000136  0.000071
    smsp__warp_issue_stalled_no_instruction_per_warp_active.pct     %           6.311340  9.950329  7.975228
    smsp__warp_issue_stalled_not_selected_per_warp_active.pct       %           28.167789 34.228746 31.538357
    smsp__warp_issue_stalled_short_scoreboard_per_warp_active.pct   %           5.872918  6.517629  6.016746
    smsp__warp_issue_stalled_sleeping_per_warp_active.pct           %           0.000000  0.000000  0.000000
    smsp__warp_issue_stalled_tex_throttle_per_warp_active.pct       %           0.000000  0.000000  0.000000
    smsp__warp_issue_stalled_wait_per_warp_active.pct               %           19.641737 20.246015 19.795740

Is something missing here?

You are missing smsp__warp_issue_stalled_selected_per_warp_active which is the proportion of warps per cycle, selected by the microscheduler to issue an instruction. This is the “not stalled” reason but it is included with the same name so you can sum to 100% of warp active cycles.

The summation of the reasons may not come out to exactly 100% as the denominator (smsp__warps_active.avg) and all of the reasons cannot be collected in the same pass so pass to pass variance can result in a slightly different value.

1 Like

Sorry I didn’t understand what is that.
As I see, smsp__warp_issue_stalled_not_selected_per_warp_active.pct is the “stall_not_selected”.
You are saying that there is another stall reason smsp__warp_issue_stalled_selected_per_warp_active and you call it “not stalled”? I guess that “not stalled” means “has not been stalled”. So that means the warp is running.

Can you clarify that?

A Volta-Ampere SM has 4 SM sub-partitions (SMSP). Each sub-partition has a warp scheduler responsible for scheduling up to N warps (GV100 = 16, TU1xx = 8, GA100 = 16, …). Each cycle active warps (those allocated to the scheduler) report the warp’s highest priority stall reason. Each cycle the warp scheduler selects an eligible warp to issue an instruction. The selected warp will report selected. The eligible warps that were not selected report not_selected. The selected reason is needed to account for all warp active cycles.

You have not provided a GPU or Nsight Compute version. From comparing your list to the command line (Nsight Compute 2020.3.0) it appears you are missing selected and branch_resolving.

In order to get the full list I would recommend using the Nsight Compute command line. You can use --device if you do not know the chip name.

ncu.bat --chips tu102 --query-metrics | grep smsp__warp_issue_stalled_.*_per_warp_active
smsp__warp_issue_stalled_barrier_per_warp_active                            proportion of warps per cycle, waiting for sibling warps at a CTA barrier
smsp__warp_issue_stalled_branch_resolving_per_warp_active                   proportion of warps per cycle, waiting for a branch target address to be computed, and the warp PC
smsp__warp_issue_stalled_dispatch_stall_per_warp_active                     proportion of warps per cycle, waiting on a dispatch stall
smsp__warp_issue_stalled_drain_per_warp_active                              proportion of warps per cycle, waiting after EXIT for all memory instructions to complete so that
smsp__warp_issue_stalled_imc_miss_per_warp_active                           proportion of warps per cycle, waiting for an immediate constant cache (IMC) miss
smsp__warp_issue_stalled_lg_throttle_per_warp_active                        proportion of warps per cycle, waiting for a free entry in the LSU instruction queue
smsp__warp_issue_stalled_long_scoreboard_per_warp_active                    proportion of warps per cycle, waiting for a scoreboard dependency on L1TEX (local, global,
smsp__warp_issue_stalled_math_pipe_throttle_per_warp_active                 proportion of warps per cycle, waiting for an execution pipe to be available
smsp__warp_issue_stalled_membar_per_warp_active                             proportion of warps per cycle, waiting on a memory barrier
smsp__warp_issue_stalled_mio_throttle_per_warp_active                       proportion of warps per cycle, waiting for a free entry in the MIO instruction queue
smsp__warp_issue_stalled_misc_per_warp_active                               proportion of warps per cycle, waiting on a miscellaneous hardware reason
smsp__warp_issue_stalled_no_instruction_per_warp_active                     proportion of warps per cycle, waiting to be selected for instruction fetch, or waiting on an
smsp__warp_issue_stalled_not_selected_per_warp_active                       proportion of warps per cycle, waiting for the microscheduler to select the warp to issue
smsp__warp_issue_stalled_selected_per_warp_active                           proportion of warps per cycle, selected by the microscheduler to issue an instruction
smsp__warp_issue_stalled_short_scoreboard_per_warp_active                   proportion of warps per cycle, waiting for a scoreboard dependency on MIO operation other than
smsp__warp_issue_stalled_sleeping_per_warp_active                           proportion of warps per cycle, waiting for a nanosleep to expire
smsp__warp_issue_stalled_tex_throttle_per_warp_active                       proportion of warps per cycle, waiting for a free entry in the TEX instruction queue
smsp__warp_issue_stalled_wait_per_warp_active                               proportion of warps per cycle, waiting on a fixed latency execution dependency
1 Like

Thanks. May I know how can I find the chip version? I wasn’t able to find that via nvidia-smi or nv-nsight-cu-cli.

You are saying that if a warp is selected, it will report stalled_selected and if a warp is not selected, it will report stall_not_selected.

When a warp is selected, it will issue an instruction. So that is not a stall. Why, there should be a stall reason for a selected warp. The selected warp is going to do some work, so the scheduler is not stalled. This is not bad. Isn’t it?

I will translate what you said to another example. There are 2 persons and we have one task. If a person is not ready, it is not selected. So, the stall reason for that person is “not_selected”. If another person is ready, the stall reason is “selected”. Well the later is not a stall. It is a progress. Isn’t that?

The only thing that I can think about is that, a selected warp doesn’t issue an instruction for some reasons. So, there is a stall named “selected” because although the warp is selected, it didn’t issue an instruction. However, from your statement, a selected warp will issue an instruction and that makes me confusing.

The sum of smsp__warp_issue_stalled_{reason}.sum == smsp__warps_active.sum.

On each cycle each warp reports one of the smsp__warp_issue_stalled reasons. If the warp is selected to issue an instruction it reports selected. The warp is not stalled. The reason that it is named smsp__warp_issue_stalled_selected is so that all reasons have similar regex-able name so user’s can determine what to sum.

Multiple warps can be eligible to issue on each cycle. The warp scheduler can pick only one warp to issue an instruction. If a warp is eligible (not stalled) and it is not picked it reports not_selected. The warp was eligible but is stalled by instruction issue unit.

The section below graphs how the warp_issue_stalled_{reaons} may be reported for the instruction sequence listed below on Volta - GA100 GPUs where FADD can be issued every 2 cycles.

FADD r0, r4, r5; # r0 = r4 + r5
FADD r1, r4, r5; # r1 = r4 + r5
FADD r2, r4, r5; # r2 = r4 + r5
FADD r3, r4, r5; # r3 = r4 + r5

There are no dependencies between FADD instructions so instructions can issued at the maximum rate of the FP32 unit (FMA pipe). If there were read after write dependencies then additional wait stall cycles would exist in the graph.

The two charts show the per cycle state of warp0 and warp1 over 18 active cycles.

S = selected
NS = not_selected
MPT = math_pipe_throttle
W = wait

The warp scheduling heuristic is not documented so this shows two different scheduling heuristics.

CASE 1 - warp scheduler issues to warp 0 if possible

cycles -->
        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18
warp0   S   W   S   W   S   W   S   W   S   -
warp1   NS  MPT NS  MPT NS  MPT NS  MPT NS  S   W   S   W   S   W   S   W   S   -

smsp__cycles_active.sum                         = 18
smsp__warps_active.sum                          = 27
smsp__warp_issue_stalled_selected.sum           = 10
smsp__warp_issue_stalled_not_selected.sum       = 5
smsp__warp_issue_stalled_math_pipe_throttle.sum = 4
smsp__warp_issue_stalled_wait.sum               = 8

CASE 2 - warp scheduler round robins between warp0 and warp 1

cycles -->
        0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18
warp0   S   W   NS  MPT S   W   NS  MPT S   W   NS  MPT S   W   NS  MPT S   -
warp1   NS  MPT S   W   NS  MPT S   W   NS  MPT S   W   NS  MPT S   W   NS  S   -

smsp__cycles_active.sum                         = 18
smsp__warps_active.sum                          = 35
smsp__warp_issue_stalled_selected.sum           = 10
smsp__warp_issue_stalled_not_selected.sum       = 9
smsp__warp_issue_stalled_math_pipe_throttle.sum = 8
smsp__warp_issue_stalled_wait.sum               = 8
1 Like