IPC at device level

Hi
Following the metric collection guide, I see these numbers related to IPC.

    gpu__cycles_active.sum                           cycle             657,154
    gpu__inst_executed.sum                            (!) n/a
    sm__cycles_active.sum                            cycle          53,309,893
    sm__inst_executed.avg.per_cycle_active           inst/cycle           3.40
    sm__inst_executed.sum                            inst          181,313,579
    smsp__inst_executed.avg.per_cycle_active         inst/cycle           0.85

The sm__inst_executed.avg.per_cycle_active is 4 times the smsp__inst_executed.avg.per_cycle_active which is correct.

Also, sm__inst_executed.sum/sm__cycles_active.sum is the same as sm__inst_executed.avg.per_cycle_active which is correct.

I would like to know why the gpu__ metrics are not correct? I expect that gpu__inst_executed.sum be the same as SMs*sm__inst_executed.sum (for 3080 the number of SMs is 68). The same is expected for cycles. That means I will see the total number of instructions executed on the GPU and the total cycles that the GPU was active. Isn’t that correct?

{unit}__ is the location that the performance counter is collected.

sm__ - the performance counter is collected at a sm level
smsp__ - the performance counter is collected a sm sub-partition level

Since inst_executed is s deterministic counter the GPU/device level value is

  • sm__inst_executed.sum, or
  • smsp__inst_executed.sum.

If you want to know the average value at the SM level then you would look at

  • sm__inst_executed.avg

Active cycles is different as it depends on work distribution. sm__cycles_active.sum and smsp__cycles_active.sum are not equivalent.

The following conditions are true for .sum:

smsp__cycles_active.sum <= (sm__cycles_active.sum * 4)
smsp__cycles_active.sum may be a few cycles less than sm__cylces_active.sum

The following conditions are true for .avg:

smsp__cycles_active.avg <=  sm__cycles_active.avg

The reason these have a different relationship is it is possible to launch insufficient work to have 1 active warp per SMSP or warps exit early and some SMSP report idle but the SM still has at least 1 warp active on 1 SMSP.

If you want the device level value for sm__cycles_active then use .sum.

If you are comparing IPC between two devices then I would recommend

sm__inst_executed.sum.per_cycle_elapsed = sm__inst_executed.sum / sm__cycles_elapsed.avg

For Volta - GA10x the value will be [0, SmCount * 4]

If you want to only consider when SMs are active then use

sm__inst_executed.sum.per_cycle_active = sm__inst_executed.sum / sm__cycles_active.avg

If you want to compare efficiency on each SM between two devices then

sm__inst_executed.avg.per_cycle_active = sm__inst_executed.avg / sm__cycles_active.avg

Thanks for the detailed information. I understand that. What is still unclear for me is this:

sm__cycles_active.sum	cycle	33.512.923
sm__inst_executed.sum	inst	58.045.167
sm__inst_executed.avg.per_cycle_active	inst/cycle	1,73
sm__inst_executed.sum.per_cycle_elapsed	inst/cycle	116,71

I expect that sm__inst_executed.sum means total number of instructions on all SMs. I mean sm__*.sum means instructions_on_sm_1+…+instructions_on_sm_68. The same is expected for sm__cycles_active.sum. Therefore, by dividing these two, I expect to see total number of instructions on the devices against total active cycles. However, the division is 1.73 which is equal to SM IPC. Maybe I have misunderstood the meaning of sum. Can you clarify that?

The per_cycle_{active, elapsed} and .pct_of_peak_sustained_{active, elapsed} denominator is always based upon [unit]_cycles{active, elapsed}.avg.

sm__inst_executed.avg.per_cycle_active = sm__inst_executed.avg / sm__cycles_active.avg
sm__inst_executed.sum.per_cycle_active = sm__inst_executed.sum / sm__cycles_active.avg