Hi Lakshay,
We continuously try improving our documentation for the metrics we expose in Nsight Compute. And the tools offer some ways to understand better what the various metrics represent. I hope the following will be helpful.
Summary
Nsight Compute can help determine the performance limiter of a CUDA kernel. These fall into the high-level categories:
- Compute-Throughput-Bound: High value of âSM %â.
- Memory-Throughput-Bound: High value for any of âMemory Pipes Busyâ, âSOL L1/TEXâ, âSOL L2â, or âSOL FBâ.
- Latency-Bound: âAchieved Occupancyâ is high, âNo Eligibleâ is high, but none of the throughputs are high
Occupancy is a measure of resident and scheduled CUDA threads, at a warp granularity. Scheduled does not imply instruction execution, although a greater pool of scheduled warps (higher occupancy) increases the chance of instruction issue per cycle. Maximum warp occupancy is limited by registers-per-thread, warps-per-block, and shared-memory-per-block.
Metrics
For any value you see in the report of the tool, you can derive the underlying metric name. In the UI, the metric name and the description is shown in tooltips when you hover over a metric label. On the CLI, you can get to the same information by searching the âsectionsâ sub-folder for the section you are interested in and open that file in a test editor. In that file, you will see definitions for all the pairs of labels with their metric names. Once you know the metric name, you can query the short description use ânv-nsight-cu-cli --query-metricsâ. That command lists all available metrics base names plus their description side-by-side. For the set of your metrics, this results in the following table (sorry for the layout, I hope this is readable):
=============================================================================================================================================================
| Label | Metric | Short Description |
=============================================================================================================================================================
| SOL FB | gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed | GPU DRAM throughput |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| SM Frequency | gpc__cycles_elapsed.avg.per_second | # of cycles where GPC was active |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| SM % | sm__throughput.avg.pct_of_peak_sustained_elapsed | SM throughput assuming ideal load balancing across SMSPs |
-------------------------------------------------------------------------------------------------------------------------------------------------------------
| Memory Pipes Busy | sm__memory_throughput.avg.pct_of_peak_sustained_elapsed | SM memory instruction throughput assuming ideal load balancing across SMSPs |
=============================================================================================================================================================
The naming convention for the metrics are documented at https://docs.nvidia.com/cupti/Cupti/r_main.html#r_host_derived_metrics_api. The idea is that the structure of the name helps to understand what a metric represents. The first part before the â__â indicates on which unit the metric is collected. This can be followed by additional sub-units or pipeline stages that further detail where the data originates from. Following the base name of a metric, the suffixes after the first â.â specify how the data is rolled up into a single value and if it is reported as raw value or as a ratio to various common baselines. For example, the metric âgpc__cycles_elapsed.avg.per_secondâ can be broken apart this away: The metric is collected per Graphics Processing Cluster (GPC). For each of the existing GPC the number of elapsed cycles while executing the kernel is collected. If your GPU has multiple GPCs (a TU102 for example has 6 GPCs), the â.avgâ specifies that we are interested in the average value for the multiple values obtained per GPC instance. And finally, the â.per_secondâ requests that instead of reporting this in raw cycles, we would like to divide the cycles by the wall clock time it took to execute the kernel in seconds. Consequently, that ratio of cycles / second is a representation of the average frequency all GPCs were operating at during the kernel launch.
The term SOL% means â% of Speed-of-Lightâ, the theoretical maximum # of operations-per-cycle for a given GPU. It has the same meaning as %-of-peak-throughput.
Throughput Metrics
You can deconstruct most other metrics in a similar way, but some metrics might require additional steps. Higher-level throughput metrics combine multiple underlying metrics into a single, aggregate value. The goal is to provide a single indicator for a whole set of pipelines or units, instead of having to go through all the individual metrics. Nsight Compute allows you to break down high-level metrics into their lower-level input metrics and report the individual results. For example, the SOL Breakdown tables in the Speed Of Light section in version 2019.5 are implemented this way. The syntax for the command line is ânv-nsight-cu-cli --metrics breakdown:sm__memory_throughput.avg <TARGET_APP>â. The output of this command will report the following metrics:
idc__request_cycles_active.avg
sm__inst_executed_pipe_adu.avg
sm__inst_executed_pipe_ipa.avg
sm__inst_executed_pipe_lsu.avg
sm__inst_executed_pipe_tex.avg
sm__mio2rf_writeback_active.avg
sm__mio_pq_read_cycles_active.avg
sm__mio_pq_write_cycles_active.avg
Nearly all metrics that can be broken down this way aggregate the higher-level metric by using the max of the requested ratio. In the case of âsm__memory_throughput.avg.pct_of_peak_sustained_elapsedâ this ratio puts all the input metrics in relation to their corresponding peak values - so the result is a percentage of the peak utilization for each of the input metrics. The high-level metric reports the maximum peak value of its sub-metric. In this case, this is used to state what is the busiest units/pipeline/bus the SM for handling memory instructions. The input metrics include coverage for the InDexed Constant Cache (IDC), several execution pipelines that handle memory instructions (e.g. for the load/store unit (LSU) and the texture unit (TEX)), as well as the return/request data path to the memory input/output unit (MIO).
Just for completeness, there is the breakdown of the metrics you mentioned:
=================================================================================================================================================================
| Metric | Breakdown | Short Description |
=================================================================================================================================================================
| gpu__dram_throughput.avg.pct | dram__cycles_active.avg | # of cycles where DRAM was active |
| | fbpa__dram_sectors.avg | # of DRAM sectors accessed |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| sm__throughput.avg | idc__request_cycles_active.avg | # of cycles where IDC processed requests from SM |
| | sm__inst_executed.avg | # of warp instructions executed |
| | sm__inst_executed_pipe_adu.avg | # of warp instructions executed by adu pipe |
| | sm__inst_executed_pipe_cbu_pred_on_any.avg | # of warp instructions executed by cbu pipe with at least 1 thread predicated on |
| | sm__inst_executed_pipe_fp16.avg | # of warp instructions executed by fp16 pipe |
| | sm__inst_executed_pipe_ipa.avg | # of warp instructions executed by ipa pipe |
| | sm__inst_executed_pipe_lsu.avg | # of warp instructions executed by lsu pipe |
| | sm__inst_executed_pipe_tex.avg | # of warp instructions executed by tex pipe |
| | sm__inst_executed_pipe_uniform.avg | # of warp instructions executed by uniform pipe |
| | sm__inst_executed_pipe_xu.avg | # of warp instructions executed by xu pipe |
| | sm__issue_active.avg | # of cycles where an SMSP issued an instruction |
| | sm__mio2rf_writeback_active.avg | # of cycles where the MIO to register file writeback interface was active |
| | sm__mio_inst_issued.avg | # of instructions issued from MIOC to MIO |
| | sm__mio_pq_read_cycles_active.avg | # of cycles where MIOP PQ sent register operands to a pipeline |
| | sm__mio_pq_write_cycles_active.avg | # of cycles where register operands from the register file were written to MIO PQ |
| | sm__pipe_alu_cycles_active.avg | # of cycles where alu pipe was active |
| | sm__pipe_fma_cycles_active.avg | # of cycles where fma pipe was active |
| | sm__pipe_fp64_cycles_active.avg | # of cycles where fp64 pipe was active |
| | sm__pipe_shared_cycles_active.avg | # of cycles where the 'shared pipe' fp16+tensor was active |
| | sm__pipe_tensor_cycles_active.avg | # of cycles where tensor pipe was active |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
| sm__memory_throughput.avg | idc__request_cycles_active.avg | # of cycles where IDC processed requests from SM |
| | sm__inst_executed_pipe_adu.avg | # of warp instructions executed by adu pipe |
| | sm__inst_executed_pipe_ipa.avg | # of warp instructions executed by ipa pipe |
| | sm__inst_executed_pipe_lsu.avg | # of warp instructions executed by lsu pipe |
| | sm__inst_executed_pipe_tex.avg | # of warp instructions executed by tex pipe |
| | sm__mio2rf_writeback_active.avg | # of cycles where the MIO to register file writeback interface was active |
| | sm__mio_pq_read_cycles_active.avg | # of cycles where MIOP PQ sent register operands to a pipeline |
| | sm__mio_pq_write_cycles_active.avg | # of cycles where register operands from the register file were written to MIO PQ |
=================================================================================================================================================================
These breakdowns may vary per architecture. The shown breakdown is for a Turing class card. In the case of three high-level metrics we discussed here so far, the âSOL FBâ describes the percentage of peak performance the on-device memory achieved. This includes the total number of cycles the DRAM was busy as well as the total number of sectors accessed. âMemory Pipes Busyâ covers the parts of the SM that handle memory instructions. It is also reported in percentage of the possible peak value and indicates the busiest unit in that part of the chip. Likewise, âSM %â reports the busiest unit of all of SM, not just the parts that are related to memory instructions.
In fact, âsm__throughput.avgâ actually contains all of âsm__memory_throughput.avgâ as high-level metrics can take other derived metrics as input. The breakdown allows you to discover that as well by specifying the recursion level for the expansion of a metric. If you run ânv-nsight-cu-cli --metrics breakdown:1:sm__throughput.avg <TARGET_APP>â (note the additional â:1â in the command), the metric will only be broken down once, but the remaining expansions no longer happen. This results in a breakdown in only the following two metrics:
sm__instruction_throughput.avg
sm__memory_throughput.avg
And as discussed above, each of these expand in their own set of sub-metrics again. In this example, the high-level metrics of âSM %â represents the maximum peak value of all of the SM instruction throughputs and the SM memory throughputs.
Occupancy vs. Throughput
Your other two questions were about âBlock Limit registersâ and the difference between âAchieved Occupancyâ and âSM %â. Occupancy, or Warp Occupancy, is defined as the ratio of the active warps on an SM to the maximum number of active warps supported by the SM. Occupancy varies over time as warps begin and end, and can be different for each SM. The launch configuration, compile options for the kernel, and device capabilities of the target device define an upper limit of how many active warps can run on an SM (Theoretical Occupancy). One of the limiters of the theoretical occupancy is how many registers each thread of the kernel requires. Each SM has a fixed-sized register file available and the more registers each thread of a kernel requires the less active warps can run in parallel on each SM. The metric âBlock Limit registersâ describes the maximum number of blocks that can run on each SM due to this register constraint. The lowest âBlock Limitâ metric defines the overall âTheoretical Occupancyâ. In contrast to the theoretical upper limit the âAchieved Occupancyâ is collected at runtime. The underlying metric is âsm_warps_active.avg.pct_of_peak_sustained_activeâ. It models exactly the ratio of the number of active warps over the theoretical maximum for the chip. However, just because a warp is actively scheduled on an SM does not necessarily mean that it makes efficient use of the available execution resources of the GPU. âAchieved Occupancyâ tells you the average percentage of how full you keep the GPU with respect of the warps only. âSM %â tells you the peak percentage of the utilization of the busiest SM execution resource. Maybe think of the difference this way: A kernel can completely fill the SMs with active warps, but barely stress any compute unit. Or, a handful of warps can completely saturate the floating-point units.
While we work on providing more documentation for the metrics, I recommend the following resources to understand some of the metrics discussed here in more detail: https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels, https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm, and https://developer.nvidia.com/gtc/2019/video/S9345.
Thank you,
Magnus