What exactly does SM Active Cycles mean?

Hello!
I used the nsight compute cli to check the performance metrics of the two kernels, and the results are as follows:

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         9.57
    SM Frequency            cycle/nsecond         2.04
    Elapsed Cycles                  cycle      120,274
    Memory Throughput                   %         7.64
    DRAM Throughput                     %         3.15
    Duration                      usecond        58.56
    L1/TEX Cache Throughput             %         8.32
    L2 Cache Throughput                 %         3.07
    SM Active Cycles                cycle   109,837.59
    Compute (SM) Throughput             %        29.10
    ----------------------- ------------- ------------
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         9.91
    SM Frequency            cycle/nsecond         2.14
    Elapsed Cycles                  cycle      227,509
    Memory Throughput                   %        95.23
    DRAM Throughput                     %        95.23
    Duration                      usecond       106.14
    L1/TEX Cache Throughput             %        18.47
    L2 Cache Throughput                 %        41.26
    SM Active Cycles                cycle   222,054.82
    Compute (SM) Throughput             %         9.38
    ----------------------- ------------- ------------

My question is: why does kernel with low Compute (SM) Throughput has higher SM Active Cycles? What exactly does SM Active Cycles mean?

In addition, I didn’t find a detailed description of these metrics in the documentation, please let me know if they exist, thanks!

I have the same question regarding the relationship between Compute (SM) Throughput and SM Active Cycles. Also, I haven’t been able to find detailed descriptions of these metrics in the documentation. If anyone could provide insights or point to resources that explain these metrics more thoroughly, it would be greatly appreciated.

The NCU CLI --query-metrics option can be used to query simple descriptions for each metric.

Nsight Compute 2024.1.0>ncu --query-metrics | grep sm__cycles_active
sm__cycles_active Counter cycle # of cycles with at least one warp in flight

The NCU user interface can provide additional detail through tooltips and the Metrics Details pane accessible through the top level Profile menu.

sm__cycles_active.avg is the number of cycles the SM had at least 1 warp resident on the SM.

sm__cycles_active.avg.pct_of_peak_sustained_elapsed is the percentage of elapsed cycles the SM was active (sm__cycles_active.avg / sm__cycles_elapsed.avg * 100.). If this value is low then there was insufficient thread blocks to saturate the GPU or the kernel has a tail effect.

<unit>__cycles_active.avg helps determine if a unit was active.
<unit>__cycles_elapsed.avg is the total number of cycles in the clock domain for the capture.

In the example provided the duration of kernel 1 was 120K cycles and kernel 2 was 227K cycles. The SMs in the second kernel were active a higher percentage of elapse cycles (109k/120k = 90.8%) vs. (222/227 = 97.7%)

The activity of a unit does not provide a throughput or efficiency of the unit. A SM can be active with 1 warp doing dependent memory reads resulting in a < 1 SM Throughput or a SM can be active with 4 warps (1 per subpartition) issuing a long sequence of independent FFMA instructions reaching near 100% SM Throughput.

The Throughput metrics measure how close the was to reaching is maximum sustained throughput. A major unit, such as the SM, has many metrics that contribute to the final SM Throughput. The contributing metrics are available for the detailed version of the GPU SOL Throughput section.

The first kernel appears to either not have sufficient thread blocks/warps to hide latency so all Throughputs are low.

The second kernel is DRAM and L2 cache limited. Given a low L1/TEX and SM Throughput there are insufficient warps to hide L2 miss latency.