The difference between FBP and FBPA

I have been trying to understand NVIDIA GPUs’ L2 / memory hierarchy, and one question keeps coming up:

What exactly are FBP and FBPA?

The public information I could find is very limited.

The A100 whitepaper explicitly says:

Eight 512 KB L2 slices are associated with each memory controller.

So on A100, one memory controller corresponds to 8 L2 slices.
Although the V100 and H100 whitepapers do not state this explicitly, I assume they likely use a similar organization.

From the profiler documentation and my measurements, my current understanding is:

FBP = one memory controller + several L2 slices

What I measured is:

  • V100: device__attribute_fbp_count = 8, device__attribute_num_l2s_per_fbp = 8

  • A100: seemingly 8 FBPs, each with 10 L2 slices

So at least from the profiler’s point of view, FBP seems to be a partition consisting of one controller plus a group of L2 slices.

However, I cannot make sense of FBPA.

The profiler documentation says:

fbpa: The FrameBuffer Partition is a memory controller which sits between
the level 2 cache (LTC) and the DRAM.

I also see fbpa__* metrics. On V100, for example:

  • fbpa__dram_sectors.avg.peak_sustained = 2

  • fbpa__dram_sectors.sum.peak_sustained = 32

And since sum / avg = 16, this seems to imply there are 16 fbpa instances.

That suggests something like:

  • 8 FBPs

  • 2 FBPAs per FBP

This seems consistent with older P100 documentation mentioning 1 FBP = 2 FBPAs, but I still do not know what FBPA actually corresponds to in hardware.

I also have a bandwidth-related question.

V100 uses 4 × HBM2, 32 channels. If we think about the path between DRAM and the memory controllers, then a theoretical peak of 32 sectors/cycle seems reasonable.

But if fbpa is an on-chip entity, I would expect it to run in a higher clock domain than DRAM. In that case, why does fbpa__dram_sectors.sum.peak_sustained also show a peak of exactly 32 sectors/cycle? That part does not make sense to me.

My questions are:

  1. What exactly are FBP and FBPA?

  2. Is it reasonable to interpret FBP as one memory controller plus several L2 slices?

  3. What is the relationship between FBPA, FBP, and the memory controller?

  4. Why does fbpa__dram_sectors on V100 also show a peak limit of 32 sectors/cycle?

Please!! I need help, thank you guys!

Please refer the “Metrics Guide->Units” section in 2. Profiling Guide — NsightCompute 13.2 documentation

fbpa: The FrameBuffer Partition is a memory controller which sits between the level 2 cache (LTC) and the DRAM. The number of FBPAs varies across GPUs.

The FBP, Frame Buffer Partition, is a primary building blocks of NVIDIA GPUs along with GPC, General (or Graphics) Processing Cluster. The FBP contains:

  • LTCs - L2 cache cluster that contain
    • LTSs - L2 cache slices (primary unit observed for L2)
  • FBPAs - Frame Buffer Partitions (memory controller side)
    • DRAM - Memory Controller - that are the memory controllers covering channels and pseudo channels.
  • LTCs connect to FBPAs
  • LTSs connect to FBSPs

Wider memory systems such as HBM memory have more FBPA and DRAM units to handle the number of channels and pseudo channels.

The FBP is not in the same clock domain as GDDR or HBM memory. This is key to understanding the metrics as some GPUs do not support single pass (observe all unit instances) at the FBSP (memory controller) level so metrics are collected on the interface between LTS and FBPA.

Metrics prefixed with dram__ are collected at the memory controller and dram__cycles_elapsed is in terms of the memory frequency (or 1/2 the memory frequency). Where possible use dram__. If dram__ are not available then dramc__ metrics are the best option; however, these are likely in the wrong clock domain. The metrics will show correct sectors but may not show correct .avg.pct_of_peak_sustained_elapsed as dramc (really part of fbpa not the actually memory controller) is in a different clock that may not be a multiple of the memory clock resulting in a lower pct than probably if measured in the memory clock.