What does "TriageSCG" mean in the metric of TriageSCG.sm__throughput

lijinghaiwhu · November 29, 2024, 6:25am

In nsys config file, I found that nsys uses TriageSCG.sm__throughput to represent SM throughput.

So

what does “TriageSCG” here mean?
Instead of TriageSCG.sm__throughput, ncu use sm__throughput to represent SM throughput, what is the difference?
what is the difference between TriageSCG.sm__throughput and TriageAC.sm__throughput?

Greg · November 29, 2024, 8:42pm

The namespace “TriageSCG” means that all metrics in the namespace were designed to be collected in a single pass. Due to hardware limitations in collecting counters a unit throughput defined in a namespace:

<namespace>.<unit>__throughput
TriageSCG.sm__throughput

is often a subset of the full

<unit>__throughput (e.g. sm__throughput)

TriageSCG - Simultaneous Compute and Graphics
TriageAC - Asynchronous Compute

The two throughputs collect the same metrics; however, the xu_pipe is collected at a different level. The SM_{A,B,C} are hints regarding what performance monitor should be used. Defining single pass configuration is an art form due to hardware limitations in the performance monitor and unit PM signal definitions.

SM_A.TriageAC.sm__inst_executed_pipe_alu_realtime
SM_A.TriageAC.sm__inst_executed_realtime
SM_C.TriageAC.smsp__inst_executed_pipe_fma
SM_C.TriageAC.smsp__inst_executed_pipe_fmaheavy
SM_C.TriageAC.smsp__inst_executed_pipe_xu

SM_A.TriageSCG.sm__inst_executed_pipe_alu_realtime
SM_A.TriageSCG.sm__inst_executed_pipe_xu_realtime
SM_A.TriageSCG.sm__inst_executed_realtime
SM_C.TriageSCG.smsp__inst_executed_pipe_fma
SM_C.TriageSCG.smsp__inst_executed_pipe_fmaheavy

lijinghaiwhu · December 2, 2024, 3:50am

Thanks a lot. If I understand this correctly,

the difference between
sm__throughput used in ncu
and
TriageSCG.sm__throughput used in nsys
is that sm__throughput will collect the metric multi times while TriageSCG.sm__throughput just collect the metric once?

And does TriageSCG.sm__throughput include tensor_core’s throughput or only cuda core?

By the way, what is Asynchronous Compute, I didn’t find any details about this terminology. Can I simply understand that Simultaneous Compute and Graphics means graphics-related workloads and Asynchronous Compute means cuda compute workloads ?

Greg · December 3, 2024, 7:33pm

sm__throughput used by NCU has many more sub-metrics. The total number of sub-metrics cannot be collected in a single pass.

The TriageSCG and TriageAC collect a subset of the metrics that tend to collect the most critical sub-metrics.

Asynchronous Compute is a feature added in Graphics APIs that supports new asynchronous compute queues. Work submitted to asynchronous compute queues can run simultaneously with work submitted from the direct queue.

Advanced API Performance: Async Compute and Overlap | NVIDIA Technical Blog

Simultaneous Compute and Graphics is a hardware implementation support simultaneous compute and graphics.

Tools such as Nsight Systems have not defined a TriageCompute configuration for all GPUs (currently only GH100). CUDA developers can use the graphics triage groups and ignore any graphics centric metrics.

lijinghaiwhu · December 4, 2024, 1:30am

As I asked, I would like to know if TriageSCG.sm__throughput includes the throughput of Tensor Cores.

Greg · December 4, 2024, 5:33am

I listed the sub-metrics in my first reply. The TriageAC and TriageSCG do not contain tensor pipe throughput as this is not yet critical in graphics applications. In NSYS the tensor pipe metric is available in the General Metrics for NVIDIA configuration. The highest level tensor pipe active metric is sm__pipe_tensor_cycles_active_realtime.

lijinghaiwhu · December 4, 2024, 6:51am

I misunderstood your first reply. Thank you verrrrry much. This is very helpful for me to understand nsight systems.

hwilper · December 4, 2024, 4:42pm

Per the Nsight Systems user guide:

Tensor Core: If you run nsys profile --gpu-metrics-devices all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.

Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.

Topic		Replies	Views
What are the differences among Compute (SM) Throughput, Memory Throughput, and DRAM Throughput in GPU Speed Of Light Throughput in nsight compute？ Nsight Compute cuda , kernel , ncu	0	60	July 8, 2026
Metric references and description Nsight Compute	6	5662	February 23, 2024
Why the Compute Throughput's value is different from the actual Performance / Peak Performance Nsight Compute cuda , kernel , nsight , profiling	9	3760	December 31, 2025
Which nsight-compute metric to determine sm/compute utilization Nsight Compute	1	225	July 18, 2025
What is the different between “SM: Pipe Tc Cycles Active [%]” and “SM: Pipe Tensor Cycles Active [%]” in nsight compute Nsight Compute	4	201	December 23, 2025
What exactly does SM Active Cycles mean? Nsight Compute	2	2007	June 20, 2024
SM utilization exceeds 100% when profiling an app using multiple streams Nsight Compute	1	201	August 2, 2024
Memory throughput in ncu Nsight Compute cuda , pytorch	1	1162	July 15, 2022
The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload Technical Blog	6	1280	July 11, 2019
what is the mean of `gpu__compute_memory_access_throughput` Nsight Compute	4	1143	August 22, 2019

What does "TriageSCG" mean in the metric of TriageSCG.sm__throughput

Related topics