GH100 - wgmma warp group level tensor instructions
GB100 - tcgen05.mma CTA and CTA pair tensor instructions (UTC{BIHQ}MMA)
In all three cases the MMA unit (Tensor Cycles) is in the SM sub-partitions.
wmma warp level the instructions execute like a FFMA instruction. The pipeline reads the register file and dispatches to the pipeline.
wgmma warp group level instructions have additional instructions to specify that all dependencies for issuing the warp group instructions have been resolved and the instruction is issued simultaneously by all 4 SM sub-partitions allowing sharing of input.
In tcgen05 there is a new controlling unit in the SM MIO called the Tensor Core unit or TC unit (overloaded term). Instructions are dispatched from a SM sub-partition to an instruction queue. If the instruction is one CTA the TC issues the instruction on all 4 sub-partitions of the SM. If the instruction is CTA pair then the TC issues the instruction on both SMs in the TPC covering 8 sub-partitions.
A tcgen05 instruction will result in both sm__pipe_tc_cycles_active (singleton per SM) and sm[sp]__pipe_tensor_cycles_active (per SMSP) counters being updated. Please note I used the syntax sm[sp] as you can collect pipe_tensor_cycles_active at either the SM level or the SMSP (sub-partition level). For WMMA instructions the per SMSP value could differ. For WGMMA and tcgen05 the pipe_tensor_cycle_active should be the same across all 4 sub-partitions in the SM.
The TC unit is also responsible for some of the other tcgen05 (not .mma) instructions so activity of TC could exceed Tensor Cycles Active. In practice these tend to be very close. In NCU these will likely be collected in different passes so there can be some variance run to run in the % as the denominator can change and real-time stalls in the TC can change.