What does "TriageSCG" mean in the metric of TriageSCG.sm__throughput

In nsys config file, I found that nsys uses TriageSCG.sm__throughput to represent SM throughput.

So

  1. what does “TriageSCG” here mean?
  2. Instead of TriageSCG.sm__throughput, ncu use sm__throughput to represent SM throughput, what is the difference?
    image
  3. what is the difference between TriageSCG.sm__throughput and TriageAC.sm__throughput?

The namespace “TriageSCG” means that all metrics in the namespace were designed to be collected in a single pass. Due to hardware limitations in collecting counters a unit throughput defined in a namespace:

<namespace>.<unit>__throughput
TriageSCG.sm__throughput

is often a subset of the full

<unit>__throughput (e.g. sm__throughput)

TriageSCG - Simultaneous Compute and Graphics
TriageAC - Asynchronous Compute

The two throughputs collect the same metrics; however, the xu_pipe is collected at a different level. The SM_{A,B,C} are hints regarding what performance monitor should be used. Defining single pass configuration is an art form due to hardware limitations in the performance monitor and unit PM signal definitions.

SM_A.TriageAC.sm__inst_executed_pipe_alu_realtime
SM_A.TriageAC.sm__inst_executed_realtime
SM_C.TriageAC.smsp__inst_executed_pipe_fma
SM_C.TriageAC.smsp__inst_executed_pipe_fmaheavy
SM_C.TriageAC.smsp__inst_executed_pipe_xu

SM_A.TriageSCG.sm__inst_executed_pipe_alu_realtime
SM_A.TriageSCG.sm__inst_executed_pipe_xu_realtime
SM_A.TriageSCG.sm__inst_executed_realtime
SM_C.TriageSCG.smsp__inst_executed_pipe_fma
SM_C.TriageSCG.smsp__inst_executed_pipe_fmaheavy

Thanks a lot. If I understand this correctly,

the difference between
sm__throughput used in ncu
and
TriageSCG.sm__throughput used in nsys
is that sm__throughput will collect the metric multi times while TriageSCG.sm__throughput just collect the metric once?

And does TriageSCG.sm__throughput include tensor_core’s throughput or only cuda core?

By the way, what is Asynchronous Compute, I didn’t find any details about this terminology. Can I simply understand that Simultaneous Compute and Graphics means graphics-related workloads and Asynchronous Compute means cuda compute workloads ?

sm__throughput used by NCU has many more sub-metrics. The total number of sub-metrics cannot be collected in a single pass.

The TriageSCG and TriageAC collect a subset of the metrics that tend to collect the most critical sub-metrics.

Asynchronous Compute is a feature added in Graphics APIs that supports new asynchronous compute queues. Work submitted to asynchronous compute queues can run simultaneously with work submitted from the direct queue.

Advanced API Performance: Async Compute and Overlap | NVIDIA Technical Blog

Simultaneous Compute and Graphics is a hardware implementation support simultaneous compute and graphics.

Tools such as Nsight Systems have not defined a TriageCompute configuration for all GPUs (currently only GH100). CUDA developers can use the graphics triage groups and ignore any graphics centric metrics.

As I asked, I would like to know if TriageSCG.sm__throughput includes the throughput of Tensor Cores.

I listed the sub-metrics in my first reply. The TriageAC and TriageSCG do not contain tensor pipe throughput as this is not yet critical in graphics applications. In NSYS the tensor pipe metric is available in the General Metrics for NVIDIA configuration. The highest level tensor pipe active metric is sm__pipe_tensor_cycles_active_realtime.

I misunderstood your first reply. Thank you verrrrry much. This is very helpful for me to understand nsight systems.

Per the Nsight Systems user guide:

Tensor Core: If you run nsys profile --gpu-metrics-devices all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.

Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.

1 Like