Metric references and description

Hello,

I am working to fetch some kernel metrics using Nsight Compute CLI.

I am having a hard time trying to find out some information regarding the significance of some metrics that are recorded using the tool.

Could someone point me to the resources where I can find this information?

Most importantly, I would be interested in the following metrics:

  1. SOL FB
  2. SM Frequency
  3. SM %
  4. Memory Pies Busy
  5. Block Limit registers
  6. Difference between Achieved Occupancy and SM %

Also, could I get some information about which metrics I could use to get all the information regarding shared memory usage.

Thanks
Lakshay

1 Like

Hi Lakshay,

We continuously try improving our documentation for the metrics we expose in Nsight Compute. And the tools offer some ways to understand better what the various metrics represent. I hope the following will be helpful.

Summary
Nsight Compute can help determine the performance limiter of a CUDA kernel. These fall into the high-level categories:

  • Compute-Throughput-Bound: High value of ‘SM %’.
  • Memory-Throughput-Bound: High value for any of ‘Memory Pipes Busy’, ‘SOL L1/TEX’, ‘SOL L2’, or ‘SOL FB’.
  • Latency-Bound: ‘Achieved Occupancy’ is high, ‘No Eligible’ is high, but none of the throughputs are high

Occupancy is a measure of resident and scheduled CUDA threads, at a warp granularity. Scheduled does not imply instruction execution, although a greater pool of scheduled warps (higher occupancy) increases the chance of instruction issue per cycle. Maximum warp occupancy is limited by registers-per-thread, warps-per-block, and shared-memory-per-block.

Metrics
For any value you see in the report of the tool, you can derive the underlying metric name. In the UI, the metric name and the description is shown in tooltips when you hover over a metric label. On the CLI, you can get to the same information by searching the ‘sections’ sub-folder for the section you are interested in and open that file in a test editor. In that file, you will see definitions for all the pairs of labels with their metric names. Once you know the metric name, you can query the short description use ‘nv-nsight-cu-cli --query-metrics’. That command lists all available metrics base names plus their description side-by-side. For the set of your metrics, this results in the following table (sorry for the layout, I hope this is readable):

=============================================================================================================================================================
| Label             | Metric                                                  | Short Description                                                           |
=============================================================================================================================================================
| SOL FB            | gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed  | GPU DRAM throughput                                                         |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| SM Frequency      | gpc__cycles_elapsed.avg.per_second                      | # of cycles where GPC was active                                            |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| SM %              | sm__throughput.avg.pct_of_peak_sustained_elapsed        | SM throughput assuming ideal load balancing across SMSPs                    |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Memory Pipes Busy | sm__memory_throughput.avg.pct_of_peak_sustained_elapsed | SM memory instruction throughput assuming ideal load balancing across SMSPs |
=============================================================================================================================================================

The naming convention for the metrics are documented at https://docs.nvidia.com/cupti/Cupti/r_main.html#r_host_derived_metrics_api. The idea is that the structure of the name helps to understand what a metric represents. The first part before the ‘__’ indicates on which unit the metric is collected. This can be followed by additional sub-units or pipeline stages that further detail where the data originates from. Following the base name of a metric, the suffixes after the first ‘.’ specify how the data is rolled up into a single value and if it is reported as raw value or as a ratio to various common baselines. For example, the metric ‘gpc__cycles_elapsed.avg.per_second’ can be broken apart this away: The metric is collected per Graphics Processing Cluster (GPC). For each of the existing GPC the number of elapsed cycles while executing the kernel is collected. If your GPU has multiple GPCs (a TU102 for example has 6 GPCs), the ‘.avg’ specifies that we are interested in the average value for the multiple values obtained per GPC instance. And finally, the ‘.per_second’ requests that instead of reporting this in raw cycles, we would like to divide the cycles by the wall clock time it took to execute the kernel in seconds. Consequently, that ratio of cycles / second is a representation of the average frequency all GPCs were operating at during the kernel launch.

The term SOL% means “% of Speed-of-Light”, the theoretical maximum # of operations-per-cycle for a given GPU. It has the same meaning as %-of-peak-throughput.

Throughput Metrics

You can deconstruct most other metrics in a similar way, but some metrics might require additional steps. Higher-level throughput metrics combine multiple underlying metrics into a single, aggregate value. The goal is to provide a single indicator for a whole set of pipelines or units, instead of having to go through all the individual metrics. Nsight Compute allows you to break down high-level metrics into their lower-level input metrics and report the individual results. For example, the SOL Breakdown tables in the Speed Of Light section in version 2019.5 are implemented this way. The syntax for the command line is ‘nv-nsight-cu-cli --metrics breakdown:sm__memory_throughput.avg <TARGET_APP>’. The output of this command will report the following metrics:

idc__request_cycles_active.avg
sm__inst_executed_pipe_adu.avg
sm__inst_executed_pipe_ipa.avg
sm__inst_executed_pipe_lsu.avg
sm__inst_executed_pipe_tex.avg
sm__mio2rf_writeback_active.avg
sm__mio_pq_read_cycles_active.avg
sm__mio_pq_write_cycles_active.avg

Nearly all metrics that can be broken down this way aggregate the higher-level metric by using the max of the requested ratio. In the case of ‘sm__memory_throughput.avg.pct_of_peak_sustained_elapsed’ this ratio puts all the input metrics in relation to their corresponding peak values - so the result is a percentage of the peak utilization for each of the input metrics. The high-level metric reports the maximum peak value of its sub-metric. In this case, this is used to state what is the busiest units/pipeline/bus the SM for handling memory instructions. The input metrics include coverage for the InDexed Constant Cache (IDC), several execution pipelines that handle memory instructions (e.g. for the load/store unit (LSU) and the texture unit (TEX)), as well as the return/request data path to the memory input/output unit (MIO).

Just for completeness, there is the breakdown of the metrics you mentioned:

=================================================================================================================================================================
| Metric                       | Breakdown                                  | Short Description                                                                 |
=================================================================================================================================================================
| gpu__dram_throughput.avg.pct | dram__cycles_active.avg                    | # of cycles where DRAM was active                                                 |
|                              | fbpa__dram_sectors.avg                     | # of DRAM sectors accessed                                                        |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| sm__throughput.avg           | idc__request_cycles_active.avg             | # of cycles where IDC processed requests from SM                                  |
|                              | sm__inst_executed.avg                      | # of warp instructions executed                                                   |
|                              | sm__inst_executed_pipe_adu.avg             | # of warp instructions executed by adu pipe                                       |
|                              | sm__inst_executed_pipe_cbu_pred_on_any.avg | # of warp instructions executed by cbu pipe with at least 1 thread predicated on  |
|                              | sm__inst_executed_pipe_fp16.avg            | # of warp instructions executed by fp16 pipe                                      |
|                              | sm__inst_executed_pipe_ipa.avg             | # of warp instructions executed by ipa pipe                                       |
|                              | sm__inst_executed_pipe_lsu.avg             | # of warp instructions executed by lsu pipe                                       |
|                              | sm__inst_executed_pipe_tex.avg             | # of warp instructions executed by tex pipe                                       |
|                              | sm__inst_executed_pipe_uniform.avg         | # of warp instructions executed by uniform pipe                                   |
|                              | sm__inst_executed_pipe_xu.avg              | # of warp instructions executed by xu pipe                                        |
|                              | sm__issue_active.avg                       | # of cycles where an SMSP issued an instruction                                   |
|                              | sm__mio2rf_writeback_active.avg            | # of cycles where the MIO to register file writeback interface was active         |
|                              | sm__mio_inst_issued.avg                    | # of instructions issued from MIOC to MIO                                         |
|                              | sm__mio_pq_read_cycles_active.avg          | # of cycles where MIOP PQ sent register operands to a pipeline                    |
|                              | sm__mio_pq_write_cycles_active.avg         | # of cycles where register operands from the register file were written to MIO PQ |
|                              | sm__pipe_alu_cycles_active.avg             | # of cycles where alu pipe was active                                             |
|                              | sm__pipe_fma_cycles_active.avg             | # of cycles where fma pipe was active                                             |
|                              | sm__pipe_fp64_cycles_active.avg            | # of cycles where fp64 pipe was active                                            |
|                              | sm__pipe_shared_cycles_active.avg          | # of cycles where the 'shared pipe' fp16+tensor was active                        |
|                              | sm__pipe_tensor_cycles_active.avg          | # of cycles where tensor pipe was active                                          |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| sm__memory_throughput.avg    | idc__request_cycles_active.avg             | # of cycles where IDC processed requests from SM                                  |
|                              | sm__inst_executed_pipe_adu.avg             | # of warp instructions executed by adu pipe                                       |
|                              | sm__inst_executed_pipe_ipa.avg             | # of warp instructions executed by ipa pipe                                       |
|                              | sm__inst_executed_pipe_lsu.avg             | # of warp instructions executed by lsu pipe                                       |
|                              | sm__inst_executed_pipe_tex.avg             | # of warp instructions executed by tex pipe                                       |
|                              | sm__mio2rf_writeback_active.avg            | # of cycles where the MIO to register file writeback interface was active         |
|                              | sm__mio_pq_read_cycles_active.avg          | # of cycles where MIOP PQ sent register operands to a pipeline                    |
|                              | sm__mio_pq_write_cycles_active.avg         | # of cycles where register operands from the register file were written to MIO PQ |
=================================================================================================================================================================

These breakdowns may vary per architecture. The shown breakdown is for a Turing class card. In the case of three high-level metrics we discussed here so far, the ‘SOL FB’ describes the percentage of peak performance the on-device memory achieved. This includes the total number of cycles the DRAM was busy as well as the total number of sectors accessed. ‘Memory Pipes Busy’ covers the parts of the SM that handle memory instructions. It is also reported in percentage of the possible peak value and indicates the busiest unit in that part of the chip. Likewise, ‘SM %’ reports the busiest unit of all of SM, not just the parts that are related to memory instructions.

In fact, ‘sm__throughput.avg’ actually contains all of ‘sm__memory_throughput.avg’ as high-level metrics can take other derived metrics as input. The breakdown allows you to discover that as well by specifying the recursion level for the expansion of a metric. If you run ‘nv-nsight-cu-cli --metrics breakdown:1:sm__throughput.avg <TARGET_APP>’ (note the additional ‘:1’ in the command), the metric will only be broken down once, but the remaining expansions no longer happen. This results in a breakdown in only the following two metrics:

sm__instruction_throughput.avg
sm__memory_throughput.avg

And as discussed above, each of these expand in their own set of sub-metrics again. In this example, the high-level metrics of ‘SM %’ represents the maximum peak value of all of the SM instruction throughputs and the SM memory throughputs.

Occupancy vs. Throughput
Your other two questions were about ‘Block Limit registers’ and the difference between ‘Achieved Occupancy’ and ‘SM %’. Occupancy, or Warp Occupancy, is defined as the ratio of the active warps on an SM to the maximum number of active warps supported by the SM. Occupancy varies over time as warps begin and end, and can be different for each SM. The launch configuration, compile options for the kernel, and device capabilities of the target device define an upper limit of how many active warps can run on an SM (Theoretical Occupancy). One of the limiters of the theoretical occupancy is how many registers each thread of the kernel requires. Each SM has a fixed-sized register file available and the more registers each thread of a kernel requires the less active warps can run in parallel on each SM. The metric ‘Block Limit registers’ describes the maximum number of blocks that can run on each SM due to this register constraint. The lowest ‘Block Limit’ metric defines the overall ‘Theoretical Occupancy’. In contrast to the theoretical upper limit the ‘Achieved Occupancy’ is collected at runtime. The underlying metric is ‘sm_warps_active.avg.pct_of_peak_sustained_active’. It models exactly the ratio of the number of active warps over the theoretical maximum for the chip. However, just because a warp is actively scheduled on an SM does not necessarily mean that it makes efficient use of the available execution resources of the GPU. ‘Achieved Occupancy’ tells you the average percentage of how full you keep the GPU with respect of the warps only. ‘SM %’ tells you the peak percentage of the utilization of the busiest SM execution resource. Maybe think of the difference this way: A kernel can completely fill the SMs with active warps, but barely stress any compute unit. Or, a handful of warps can completely saturate the floating-point units.

While we work on providing more documentation for the metrics, I recommend the following resources to understand some of the metrics discussed here in more detail: https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels, https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm, and https://developer.nvidia.com/gtc/2019/video/S9345.

Thank you,
Magnus

6 Likes

Thanks Magnus,

That helps

Lakshay

Hi mstrengert, we own a cluster of GPU cards, and we’re trying to know how much computing power (in percentage) we’re taking out of the GPU cards (from a cluster point of view) so that we could setup a target to optimize the utilization of our expensive GPU cards.

So we need a cluster-level metric that:

  1. job-independent and easy to collect;
  2. represent the actual efficiency of the computing cores (we treat computing cores as the most important resources, memory resources are supplementary resources to computing cores);
  3. as accurate as possible (but no need to be 100% accurate);

With these requirements, we ruled out GPU Utilization because it is way too coarse grained. SM Activation and SM Occupancy seems to be good candidates, but they can’t tell the stalled warps from the actual computing warps.

The metric that seems a good fit is SM SOL% in Nsight Compute, but that metric is kernel level. So we wonder if we could define a compound metric that share the same semantic / formula only in cluster level. That’s how we found this post.

According to your detailed explanation, the SM SOL%(sm__throughput.avg) is the maximum of the ~20 aggregated higher-level metrics (eg, sm__inst_executed.avg), so is it possible these higher-level metrics have a mapping to DCGM metrics? If so, we could define our Cluster SM SOL% to max(bunch of DCGM metrics). Skimming through the SM SOL% metrics, some metrics, like sm__inst_executed_pipe_fp16, could be mapped to DCGM metrics, but other metrics doesn’t seem to have a good map, for example, could we use DRAM_ACT to represent the sm__inst_executed_pipe_lsu?

Also, does this cluster-level metric sound reasonable to you? What else metrics do you suggest us to check out? Any comments are a huge help to us! Thank you!

1 Like

@mstrengert
hi I am using nisght 2023, I also want to know sm__memory_throughput breakdown information in my RTX4090, I tried : ncu --metrics breakdown:sm__memory_throughput.avg
but error with :==ERROR== Missing executable to launch.
How can I fix this problem ? Could you give me some help ?

Hi, @996937805

You need to add an application executable in the command line for profile analysis.

1 Like

yes I tried this: sudo /usr/local/cuda/bin/ncu --metrics breakdown:sm__memory_throughput.avg ./sgemm1
then I can see breakdown metrics.
Thanks a lot!

This topic was automatically closed after 4 days. New replies are no longer allowed.