What is the meaning of sm__pipe_fp64_cycles_active[burst/sustained]


I’ve read the documentation about burst/sustained metrics. But, unfortunately, I didn’t understand the meaning the burst and sustained.
In particular, in relation to the output (AI100):

sm__pipe_fp64_cycles_active.avg.peak_burst                                                                          4
sm__pipe_fp64_cycles_active.avg.peak_sustained                                                                      4
smsp__inst_executed_pipe_fp64.avg.pct_of_peak_burst_active                           %                          19.95
smsp__inst_executed_pipe_fp64.avg.pct_of_peak_sustained_active                       %                          79.81
  1. What the meaning of sm__pipe_fp64_cycles_active.avg.peak_[burst/sustained]_active?
  2. What the meaning of smsp__inst_executed_pipe_fp64.avg.pct_of_peak_[burst/sustained]_active and why the burst < sustained?

Perhaps, the answer on my first question will help me understand the second one.

This is documented in the Nsight Compute Kernel Profiling Guide in Section 3.2 Metrics Structure.

Two types of peak rates are available for every counter: burst and sustained. Burst rate is the maximum rate reportable in a single clock cycle. Sustained rate is the maximum rate achievable over an infinitely long measurement period, for “typical” operations. For many counters, burst equals sustained. Since the burst rate cannot be exceeded, percentages of burst rate will always be less than 100%. Percentages of sustained rate can occasionally exceed 100% in edge cases.

For SW development I would recommend only using “sustained” metrics. Nsight Compute should not be collecting and “burst” metrics.

  1. What the meaning of sm__pipe_fp64_cycles_active.avg.peak_[burst/sustained]_active ?

This is the primary throughput metric for the SM FP64 math pipes. The output of this metrics is a ratio between 0-1.

sm__pipe_fp64_cycles_active.avg.peak_sustained_active = sm__pipe_fp64_cycles_active.avg / (sm__pipe_fp64_cycles_active.avg.peak_sustained x sm__cycles_active.avg)

sm__pipe_fp64_cycles_active.avg = # of cycles the SM FP64 units (1 or 4) are active average across all SMs
.peak_sustained_active = # of cycles this could be sustained per cycle
  1. What the meaning of smsp__inst_executed_pipe_fp64.avg.pct_of_peak_[burst/sustained]_active and why the burst < sustained ?

This is the primary throughput metric for the SMSP FP64 math pipes as a % of active cycles. On GV100 and GA100 each SM sub-partition has a FP64 unit. On graphics oriented parts there is 1 FP64 unit per SM often at a significantly lower throughput.

Please do not use the “burst” metrics. The value is not set correctly for all metrics.

Nsight Compute tends to use “active” vs. “elapsed” as it is focused on times when the GPU should be fully used and the user should look at sm__cycles_active.avg.pct_of_peak_sustained_active first to determine if full SMs are idle.

Nsight System GPU metrics collect counters over time and tend to use “elapsed” cycles which result in each SM counter being reduced by the activity of the SM.

1 Like

@Greg , thank you very much for your detailed explanations!
A quick question :) :
Let’s suppose, the ncu collected the sequence of measures of sm__sass_thread_inst_executed_op_dadd_pred_on.avg (AI100, four units in a SM):

sm__sass_thread_inst_executed_op_dadd_pred_on.avg.peak_burst                inst/cycle                            128
sm__sass_thread_inst_executed_op_dadd_pred_on.avg.peak_sustained            inst/cycle                             32

Let’s simplify the distribution of the metric .avg over the measurement period. Then, can we assume that the distribution of values over cycles can be like this?:

 sm__sass_thread_inst_executed_op_dadd_pred_on.avg    |   cycle
128                                                   |   N
0                                                     |   N+1
0                                                     |   N+2
0                                                     |   N+3
128                                                   |   N+4
0                                                     |   N+5
0                                                     |   N+6
0                                                     |   N+7
....                                                  |   ...

Thanks in advance!

If the hardware had SM sub-partition counters for number of predicated true threads issuing a DADD then you would see an increment by 0-32 each on cycle cycle the SM warp scheduler issued a FP64 (DADD, DMUL, DFMA) instruction. For GV100 and GA100 the issue rate is every 4 cycles. For graphics focused parts the FP64 unit is shared between all 4 SM warp schedulers and the issue rate is less than 8 threads/cycle.

  • Each time a warp scheduler issues a DADD instruction the SMSP counter would increment by 0-32 increment based upon the active mask and guard predicate mask.
  • Each warp scheduler is independent so the increments would not be aligned across the 4 schedulers.

Given that the counter references is a SASS metrics there would be 100s of cycles between each DADD instruction issue due to the complexity of the assembly code patch to count the number of predicate true threads for that instruction.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.