I’m using a 4060 laptop GPU, which has a theoretical DRAM bandwidth of 256 GB/s.
I now have a kernel that I’m analyzing using NCU.
In GPU Speed Of Light Throughput
dram__cycles_active.avg.pct_of_peak_sustained_elapsed is 98.03%
fbpa__dram_sectors.avg.pct_of_peak_sustained_elapsed is 62.79%
In Memory Workload Analysis
dram__bytes.sum.per_second is 250.75GB/s. Approximately 98% of the maximum bandwidth.
I want to know what the relationship is between these three indicators.
Why is fbpa__dram_sectors.avg.pct_of_peak_sustained_elapsed so small?
Which metric should be focused on during performance tuning to achieve maximum bandwidth?
dram__cycles_active.avg.pct_of_peak_sustained_elapsed is the preferred metric as the denominator is on the memory clock.
fbpa__dram_sectors.avg.pct_of_peak_sustained_elapsed is captured at an upstream unit that is on a different clock that is often at a higher frequency than the memory clock resulting in a lower % of SOL. The L2 to memory controller interface is not the bottleneck.
The sector_{read, write} count from both locations should match (± small error if realtime counters are used).
dram__bytes.sum == dram__sectors.sum * bytes_per_sector
dram__cycles_active.sum = dram__sectors.sum
dram__throughput = dram__cycles_active.sum / dram__cycles_elapsed * 100. (this varies on some chips with an additional clock divider)