NCU L2 metric understanding

liu2550 · February 25, 2026, 10:13pm

Hi everyone,

I’m studying the interference between concurrent NCCL communication and GEMM compute kernels, and I’m trying to understand their contention in the L2 cache.

Using Nsight Compute, I came across the following metrics:

lts__average_t_sector_srcnode_gpc
lts__average_t_sector_srcnode_fbp
lts__average_t_sector_srcnode_hub

From the names, my current understanding is:

GPC: traffic originating from SM-side units (SMs, L1/TEX, etc.)
FBP: traffic from the frame buffer partition (HBM ↔ L2 path)
HUB: traffic from external links (NVLink / PCIe ↔ L2)

However, I’m confused about the finer-grained breakdowns of these metrics.

For example, what exactly does lts__average_t_sector_srcnode_fbp_op_write represent?

My current interpretation is: this corresponds to write operations that miss in L2 and are serviced by HBM, so they are attributed to the FBP source node.

Is this interpretation correct? Or does this metric represent something more specific (e.g., writebacks vs. fills)?

Any clarification would be greatly appreciated. Thanks!

Greg · February 26, 2026, 4:18pm

FBP - The Frame Buffer Partition contains the L2 cache slices (lts) and the memory controllers (dram).

On GH100 and GB100 chips the memory subsystem is divided into two primary partitions each servicing 4 GPCs. When a GPC on one partition reads from memory attached to the other partition the request is sent to a local L2 cache slice (local caching node) and then direct over the LTCFABRIC to the remote memory partition L2 cache slice (point of coherence).
On older GPUs the Raster Operation (ROP) units were in the FBP.

srcnode_fbp on GH100 and GB100 should equal srcunit_ltcfabric (only child).

The metric you have listed is are observed at the L2 slice tag stage (lts__t). All hits/misses will be attributed to the requesting srcnode or srcunit. A miss from the L1TEX is not attributed to FBP. L2 misses resulting in a fill do not double count at the tag stage as the location of the cache line is known.

liu2550 · February 26, 2026, 5:18pm

@Greg Thank you so much for the detailed explanation, it really clarified the role of FBP for me.

Could you elaborate a bit more on the HUB source node? Is it accurate to think of HUB as representing L2 traffic originating from external links (e.g., NVLink / PCIe)?

If so, would lts__average_t_sector_srcnode_hub_op_write correspond to remote writes arriving over external links and being serviced by the local L2?

And on the sender side, is there a corresponding metric that attributes those remote writes on the source GPU other than nvltx__*? My understanding is that remote writes typically bypass L1 and are injected directly into the fabric, but I’d love to confirm whether there’s a metric that captures this from the transmitting side.

Greg · February 26, 2026, 5:52pm

srcnode_hub will include operations to gpu memory by

copy engines
video engines
graphics/compute front end
hardware performance monitor
display controller
memory management unit
host over pcie or c2c
remote target over pcie or nvlink

There are not good metrics from the sender side to understand traffic over nvlink. SM to NVLINK or copy engine to NVLINK will bypass L2 so nvl{rx,tx}__bytes_packet_{request,response}_* can be used to provide slightly more information over just nvl{rx,tx}__bytes[_data*] but there is not a breakdown by sender.

Topic		Replies	Views
What is the relationship between following metrics? Nsight Compute	2	165	November 8, 2024
What's the meaning of performance counter: lts__t_sectors_srcunit_ltcfabric Nsight Compute	4	960	September 28, 2022
Hopper L2 partition data copy error? Nsight Compute	5	2480	June 7, 2024
L2 sectors question! Nsight Compute	2	260	June 27, 2025
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1707	August 28, 2019
Ampere GPU L2 cache write miss policy CUDA Programming and Performance	3	988	February 9, 2022
Different betweent in lts__t_sectors_srcunit_tex_op_read.sum and lts__t_bytes.sum Nsight Compute	5	708	June 24, 2024
Wrong bytes from L1/TEX cache to LRC? Nsight Compute	1	43	January 30, 2026
Nsight compute metrics for L1 and L2 Nsight Compute	4	662	October 12, 2023
Problems with lts__t_requests_srcunit_tex_aperture_peer Nsight Compute	7	827	March 18, 2025

NCU L2 metric understanding

Related topics