The namespace “TriageSCG” means that all metrics in the namespace were designed to be collected in a single pass. Due to hardware limitations in collecting counters a unit throughput defined in a namespace:
TriageSCG - Simultaneous Compute and Graphics
TriageAC - Asynchronous Compute
The two throughputs collect the same metrics; however, the xu_pipe is collected at a different level. The SM_{A,B,C} are hints regarding what performance monitor should be used. Defining single pass configuration is an art form due to hardware limitations in the performance monitor and unit PM signal definitions.
the difference between sm__throughput used in ncu
and TriageSCG.sm__throughput used in nsys
is that sm__throughput will collect the metric multi times while TriageSCG.sm__throughput just collect the metric once?
And does TriageSCG.sm__throughput include tensor_core’s throughput or only cuda core?
By the way, what is Asynchronous Compute, I didn’t find any details about this terminology. Can I simply understand that Simultaneous Compute and Graphics means graphics-related workloads and Asynchronous Compute means cuda compute workloads ?
sm__throughput used by NCU has many more sub-metrics. The total number of sub-metrics cannot be collected in a single pass.
The TriageSCG and TriageAC collect a subset of the metrics that tend to collect the most critical sub-metrics.
Asynchronous Compute is a feature added in Graphics APIs that supports new asynchronous compute queues. Work submitted to asynchronous compute queues can run simultaneously with work submitted from the direct queue.
Simultaneous Compute and Graphics is a hardware implementation support simultaneous compute and graphics.
Tools such as Nsight Systems have not defined a TriageCompute configuration for all GPUs (currently only GH100). CUDA developers can use the graphics triage groups and ignore any graphics centric metrics.
I listed the sub-metrics in my first reply. The TriageAC and TriageSCG do not contain tensor pipe throughput as this is not yet critical in graphics applications. In NSYS the tensor pipe metric is available in the General Metrics for NVIDIA configuration. The highest level tensor pipe active metric is sm__pipe_tensor_cycles_active_realtime.
Tensor Core: If you run nsys profile --gpu-metrics-devices all, the Tensor Core utilization can be found in the GUI under the SM instructions/Tensor Active row.
Please note that it is not practical to expect a CUDA kernel to reach 100% Tensor Core utilization since there are other overheads. In general, the more computation-intensive an operation is, the higher Tensor Core utilization rate the CUDA kernel can achieve.