NCU L2 metric understanding

Hi everyone,

I’m studying the interference between concurrent NCCL communication and GEMM compute kernels, and I’m trying to understand their contention in the L2 cache.

Using Nsight Compute, I came across the following metrics:

lts__average_t_sector_srcnode_gpc
lts__average_t_sector_srcnode_fbp
lts__average_t_sector_srcnode_hub

From the names, my current understanding is:

  • GPC: traffic originating from SM-side units (SMs, L1/TEX, etc.)
  • FBP: traffic from the frame buffer partition (HBM ↔ L2 path)
  • HUB: traffic from external links (NVLink / PCIe ↔ L2)

However, I’m confused about the finer-grained breakdowns of these metrics.

For example, what exactly does lts__average_t_sector_srcnode_fbp_op_write represent?

My current interpretation is: this corresponds to write operations that miss in L2 and are serviced by HBM, so they are attributed to the FBP source node.

Is this interpretation correct? Or does this metric represent something more specific (e.g., writebacks vs. fills)?

Any clarification would be greatly appreciated. Thanks!

FBP - The Frame Buffer Partition contains the L2 cache slices (lts) and the memory controllers (dram).

  • On GH100 and GB100 chips the memory subsystem is divided into two primary partitions each servicing 4 GPCs. When a GPC on one partition reads from memory attached to the other partition the request is sent to a local L2 cache slice (local caching node) and then direct over the LTCFABRIC to the remote memory partition L2 cache slice (point of coherence).
  • On older GPUs the Raster Operation (ROP) units were in the FBP.

srcnode_fbp on GH100 and GB100 should equal srcunit_ltcfabric (only child).

The metric you have listed is are observed at the L2 slice tag stage (lts__t). All hits/misses will be attributed to the requesting srcnode or srcunit. A miss from the L1TEX is not attributed to FBP. L2 misses resulting in a fill do not double count at the tag stage as the location of the cache line is known.

1 Like

@Greg Thank you so much for the detailed explanation, it really clarified the role of FBP for me.

Could you elaborate a bit more on the HUB source node? Is it accurate to think of HUB as representing L2 traffic originating from external links (e.g., NVLink / PCIe)?

If so, would lts__average_t_sector_srcnode_hub_op_write correspond to remote writes arriving over external links and being serviced by the local L2?

And on the sender side, is there a corresponding metric that attributes those remote writes on the source GPU other than nvltx__*? My understanding is that remote writes typically bypass L1 and are injected directly into the fabric, but I’d love to confirm whether there’s a metric that captures this from the transmitting side.

srcnode_hub will include operations to gpu memory by

  • copy engines
  • video engines
  • graphics/compute front end
  • hardware performance monitor
  • display controller
  • memory management unit
  • host over pcie or c2c
  • remote target over pcie or nvlink

There are not good metrics from the sender side to understand traffic over nvlink. SM to NVLINK or copy engine to NVLINK will bypass L2 so nvl{rx,tx}__bytes_packet_{request,response}_* can be used to provide slightly more information over just nvl{rx,tx}__bytes[_data*] but there is not a breakdown by sender.