Difference sm__cycles_elapsed/smsp__cycles_elapsed and sm__inst_executed/smsp__inst_executed?

Hello! I have a dumb question :), but I’m a little confused.
I don’t understand clearly the difference between this metrics, and how to the cycles counters related to the HW architecture.

My current understanding (AI100):

  1. Device → SMs->4 sub partitions{An individual: warp scheduler, …, Execution units}.
  2. SM and each sub partition have individual cycle counters.
    This is right?

I have the output:

sm__cycles_active.sum      186106798
smsp__cycles_active.sum  744269206
sm__inst_executed.sum     192937984
smsp__inst_executed.sum 192937984

The ratio
(smsp__cycles_active*4) ~ sm__cycles_active it looks intuitive ok.

But, I don’t understand, why sm__inst_executed.sum == smsp__inst_executed.sum?


Each SM has four sub-partitions, correct. The problem in understanding the numbers comes from the difference in what is counted (cycles vs instructions). As a suggestion, in such cases, it can be helpful to not only collect one sub-metric (.sum), but at least all first-level sub-metrics (sum/avg/min/max). There is no profiling overhead penalty for that. When using the CLI, you can omit the suffix to collect all first-level sub-metrics, e.g. --metrics sm__cycles_active,sm__inst_executed .

As a first example, consider that you have only one thread, or one warp, so that only one SMSP of one SM is active. In this case, you will find that the min, max and sum numbers for cycles between SM and SMSP actually (almost) match up, and the numbers for instructions are identical. (The averages will differ, as the number of SMs and SMSPs is not the same).

When going from one thread to say 256 threads in one block, all SMSPs of the first SM become active. Let’s assume SMSP 0 is active for 1500 cycles and SMSP 1-3 for 500 cycles, each. smsp__cycles_active.sum will be 3000. However, sm__cycles_active.sum will only be 1500 (or a little more), since it’s the sum across all SMs, not the sum across all SMSPs, and the 500 cycles of SMSPs 1-3 are overlapped or “hidden” by the 1500 cycles of the longest running SMSP 0. That’s because they don’t have separate cycles counters, they are on the same cycle.

Instructions are counted by a different signal in the HW and they don’t “overlap”, since each individual instruction is executed by itself (in contrast to cycles, where multiple SMSPs can be on the same cycle signal). As such, SMSPs might execute (15, 5, 5, 5) instructions, respectively, leading to smsp__inst_executed.sum being 30 (max being 15 and min 0, since all SMSPs of all SMs are considered). Nevertheless, sm__inst_executed.sum (and max) will also be 30, since the first SM truly executed 30 individual instructions.

1 Like

@felix_dt , thank you very much! Now, the terms have become clearer.
For more understanding, I’d like going to from one SM to multiple ones :).

Suppose, we going to say 512 threads in two blocks (2 blocks of 256). Block 0 on SM 0 and block 1 on SM 1. Let’s assume, we have the counters:

SMSP;    cycles active;        instructions executed;
0              1500                        15
1               500                        5
2               500                        5
3               500                        5

SMSP;    cycles active;        instructions executed;
0              2000                       20
1              1000                       5
2              1000                       15
3              1000                       15

Then, i suggest, the metrics will be:

smsp__cycles_active.sum   8000
sm__cycles_active.sum       3500
smsp__inst_executed.sum  75
sm__inst_executed.sum      75

It looks like the truth?)

What the means of the metric smsp__sass_thread_inst_executed_op_dadd_pred_on.sum shows? I guess it is the total number of DADD instructions of kernel over the all SMSPs x the all SMs in GPU {SM0:SMSP0 + SM0:SMSP1 + , …, + SM1:SMSP3) ?

Then, i suggest, the metrics will be

Yes, that’s the correct understanding.

What the means of the metric smsp__sass_thread_inst_executed_op_dadd_pred_on.sum shows? I guess it is the total number of DADD instructions of kernel over the all SMSPs x the all SMs in GPU {SM0:SMSP0 + SM0:SMSP1 + , …, + SM1:SMSP3) ?

That is also correct. More precisely, it’s the number of such instructions executed with an on/active-predicate (vs instructions that were executed but had the predicate disabled, thereby having no effect).


Hello, @felix_dt , thank you!
Perhaps, I can ask an another related question about instructions metrics in this thread :).

I have the output:

smsp__sass_thread_inst_executed_op_dadd_pred_on.sum    inst                   17850957824
smsp__inst_executed.sum                                inst                    1361608704

Why the smsp__inst_executed.sum < smsp__sass_thread_inst_executed_op_dadd_pred_on.sum ? - I’m still thinking, smsp__inst_executed.sum is the total number of the all instructions executed on a device in the kernel (so, it includes the _dadd_ instructions).

Thank you very much!

The difference is that all inst_executed metrics generally refer to warp-instructions, i.e. the number of instructions executed per CUDA warp. thread_inst_executed refer to thread-instructions, i.e. the number executed per individual CUDA thread. Both are relevant, e.g. for understanding intra-warp divergence. The UI’s Source page shows these, and there is more information in the associated documentation.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.