What is the different between “SM: Pipe Tc Cycles Active [%]” and “SM: Pipe Tensor Cycles Active [%]” in nsight compute

I see the description in nsight compute:

1) for “SM: Pipe Tc Cycles Active [%]”:

SM: Pipe Tc Cycles Active
sm__pipe_tc_cycles_active.avg.pct_of_peak_sustained_elapsed

tc: Tensor Core.
The TC pipeline executes UTCBAR, UTCCP, UTCMMA, UTCSHIFT and UTCSWS instructions.
It is different from the Tensor pipeline.

2) for “SM: Pipe Tensor Cycles Active [%]”:

SM: Pipe Tensor Cycles Active
sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed

tensor: The Tensor pipeline executes various MMA instructions.
It is different from the Tensor Core pipeline.

This description is so confused. What is the actual diff?

I have a ncu report:

Instruction:

ncu_report: bf16_9_1024_16_128_false.ncu-rep.zip (1.8 MB)

In this report, it have not MMA inst, but only have UTC MMA inst. Why Pipe Tensor(Metric: “SM: Pipe Tensor Cycles Active [%]”) is very busy?

Thanks!

The SM has three types of Tensor instructions:

  1. wmma warp level tensor instructions ({BIHQ}MMA)
  2. GH100 - wgmma warp group level tensor instructions
  3. GB100 - tcgen05.mma CTA and CTA pair tensor instructions (UTC{BIHQ}MMA)

In all three cases the MMA unit (Tensor Cycles) is in the SM sub-partitions.

wmma warp level the instructions execute like a FFMA instruction. The pipeline reads the register file and dispatches to the pipeline.

wgmma warp group level instructions have additional instructions to specify that all dependencies for issuing the warp group instructions have been resolved and the instruction is issued simultaneously by all 4 SM sub-partitions allowing sharing of input.

In tcgen05 there is a new controlling unit in the SM MIO called the Tensor Core unit or TC unit (overloaded term). Instructions are dispatched from a SM sub-partition to an instruction queue. If the instruction is one CTA the TC issues the instruction on all 4 sub-partitions of the SM. If the instruction is CTA pair then the TC issues the instruction on both SMs in the TPC covering 8 sub-partitions.

A tcgen05 instruction will result in both sm__pipe_tc_cycles_active (singleton per SM) and sm[sp]__pipe_tensor_cycles_active (per SMSP) counters being updated. Please note I used the syntax sm[sp] as you can collect pipe_tensor_cycles_active at either the SM level or the SMSP (sub-partition level). For WMMA instructions the per SMSP value could differ. For WGMMA and tcgen05 the pipe_tensor_cycle_active should be the same across all 4 sub-partitions in the SM.

The TC unit is also responsible for some of the other tcgen05 (not .mma) instructions so activity of TC could exceed Tensor Cycles Active. In practice these tend to be very close. In NCU these will likely be collected in different passes so there can be some variance run to run in the % as the denominator can change and real-time stalls in the TC can change.