I have been trying to understand NVIDIA GPUs’ L2 / memory hierarchy, and one question keeps coming up:
What exactly are FBP and FBPA?
The public information I could find is very limited.
The A100 whitepaper explicitly says:
Eight 512 KB L2 slices are associated with each memory controller.
So on A100, one memory controller corresponds to 8 L2 slices.
Although the V100 and H100 whitepapers do not state this explicitly, I assume they likely use a similar organization.
From the profiler documentation and my measurements, my current understanding is:
FBP = one memory controller + several L2 slices
What I measured is:
-
V100:
device__attribute_fbp_count = 8,device__attribute_num_l2s_per_fbp = 8 -
A100: seemingly 8 FBPs, each with 10 L2 slices
So at least from the profiler’s point of view, FBP seems to be a partition consisting of one controller plus a group of L2 slices.
However, I cannot make sense of FBPA.
The profiler documentation says:
fbpa: The FrameBuffer Partition is a memory controller which sits between
the level 2 cache (LTC) and the DRAM.
I also see fbpa__* metrics. On V100, for example:
-
fbpa__dram_sectors.avg.peak_sustained = 2 -
fbpa__dram_sectors.sum.peak_sustained = 32
And since sum / avg = 16, this seems to imply there are 16 fbpa instances.
That suggests something like:
-
8 FBPs
-
2 FBPAs per FBP
This seems consistent with older P100 documentation mentioning 1 FBP = 2 FBPAs, but I still do not know what FBPA actually corresponds to in hardware.
I also have a bandwidth-related question.
V100 uses 4 × HBM2, 32 channels. If we think about the path between DRAM and the memory controllers, then a theoretical peak of 32 sectors/cycle seems reasonable.
But if fbpa is an on-chip entity, I would expect it to run in a higher clock domain than DRAM. In that case, why does fbpa__dram_sectors.sum.peak_sustained also show a peak of exactly 32 sectors/cycle? That part does not make sense to me.
My questions are:
-
What exactly are FBP and FBPA?
-
Is it reasonable to interpret FBP as one memory controller plus several L2 slices?
-
What is the relationship between FBPA, FBP, and the memory controller?
-
Why does
fbpa__dram_sectorson V100 also show a peak limit of 32 sectors/cycle?