What exactly does SM Active Cycles mean?

thiltuiv · June 20, 2024, 3:09am

Hello!
I used the nsight compute cli to check the performance metrics of the two kernels, and the results are as follows:

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         9.57
    SM Frequency            cycle/nsecond         2.04
    Elapsed Cycles                  cycle      120,274
    Memory Throughput                   %         7.64
    DRAM Throughput                     %         3.15
    Duration                      usecond        58.56
    L1/TEX Cache Throughput             %         8.32
    L2 Cache Throughput                 %         3.07
    SM Active Cycles                cycle   109,837.59
    Compute (SM) Throughput             %        29.10
    ----------------------- ------------- ------------

    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         9.91
    SM Frequency            cycle/nsecond         2.14
    Elapsed Cycles                  cycle      227,509
    Memory Throughput                   %        95.23
    DRAM Throughput                     %        95.23
    Duration                      usecond       106.14
    L1/TEX Cache Throughput             %        18.47
    L2 Cache Throughput                 %        41.26
    SM Active Cycles                cycle   222,054.82
    Compute (SM) Throughput             %         9.38
    ----------------------- ------------- ------------

My question is: why does kernel with low Compute (SM) Throughput has higher SM Active Cycles? What exactly does SM Active Cycles mean?

In addition, I didn’t find a detailed description of these metrics in the documentation, please let me know if they exist, thanks!

lssyes_shuai · June 20, 2024, 9:17am

I have the same question regarding the relationship between Compute (SM) Throughput and SM Active Cycles. Also, I haven’t been able to find detailed descriptions of these metrics in the documentation. If anyone could provide insights or point to resources that explain these metrics more thoroughly, it would be greatly appreciated.

Greg · June 20, 2024, 3:21pm

The NCU CLI --query-metrics option can be used to query simple descriptions for each metric.

Nsight Compute 2024.1.0>ncu --query-metrics | grep sm__cycles_active
sm__cycles_active Counter cycle # of cycles with at least one warp in flight

The NCU user interface can provide additional detail through tooltips and the Metrics Details pane accessible through the top level Profile menu.

sm__cycles_active.avg is the number of cycles the SM had at least 1 warp resident on the SM.

sm__cycles_active.avg.pct_of_peak_sustained_elapsed is the percentage of elapsed cycles the SM was active (sm__cycles_active.avg / sm__cycles_elapsed.avg * 100.). If this value is low then there was insufficient thread blocks to saturate the GPU or the kernel has a tail effect.

<unit>__cycles_active.avg helps determine if a unit was active.
<unit>__cycles_elapsed.avg is the total number of cycles in the clock domain for the capture.

In the example provided the duration of kernel 1 was 120K cycles and kernel 2 was 227K cycles. The SMs in the second kernel were active a higher percentage of elapse cycles (109k/120k = 90.8%) vs. (222/227 = 97.7%)

The activity of a unit does not provide a throughput or efficiency of the unit. A SM can be active with 1 warp doing dependent memory reads resulting in a < 1 SM Throughput or a SM can be active with 4 warps (1 per subpartition) issuing a long sequence of independent FFMA instructions reaching near 100% SM Throughput.

The Throughput metrics measure how close the was to reaching is maximum sustained throughput. A major unit, such as the SM, has many metrics that contribute to the final SM Throughput. The contributing metrics are available for the detailed version of the GPU SOL Throughput section.

The first kernel appears to either not have sufficient thread blocks/warps to hide latency so all Throughputs are low.

The second kernel is DRAM and L2 cache limited. Given a low L1/TEX and SM Throughput there are insufficient warps to hide L2 miss latency.

veraj · July 30, 2024, 9:40am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Metric references and description Nsight Compute	7	4571	March 2, 2024
How to quantify kernel launch overhead using NCU? Visual Profiler and nvprof	8	1858	April 30, 2025
Cycles in nsight-compute and nsight-systems Nsight Compute	2	1226	October 26, 2022
Is there a way to inspect the time cost of each individual cuda block? Nsight Compute	12	196	October 30, 2024
How to profile overall SM utilization of the program by Nsight Compute? Nsight Compute	9	2248	July 27, 2023
SM frequency reported in Nsight Compute Nsight Compute	4	936	September 1, 2023
Why is the sm__warps_active so high Nsight Compute	3	164	April 21, 2025
Can you use nsight to see tensor core occupancy? Nsight Compute cudnn	4	1023	March 23, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	7	3018	October 28, 2022
What is SOL ( speed of light)? Nsight Compute	5	7027	October 8, 2021

What exactly does SM Active Cycles mean?

Related topics