So NSight reports something called SOL L1/TEX Cache [%]
, which seems to be the L1 cache throughput. But I was wondering what throughput means for a cache? I’d think hit rate is a more common metric. How do we calculate a cache’s throughput? If a cache’s throughput means the number of bytes per second/cycle, it doesn’t seem like a higher cache throughput means better performance, since the program might not access the cache and so the throughput would be low?
You can see the description and name of the metric in the table when hovering over the entry in the UI, verifying that this is indeed the throughput of the L1/TEX cache. The underlying metric name is l1tex__throughput.avg.pct_of_peak_sustained_active. This is indeed different from the hit which is shown as part of the Memory Workload Analysis section as L1/TEX Hit Rate (l1tex__t_sector_hit_rate.pct).
The idea of the throughput metrics in the GPU Speed Of Light section is not so much to show good or bad performance, but a measure of the unit’s utilization. Consider that there are multiple other HW clients accessing this cache and as the utilization increases to the maximum, its ability to handle additional incoming requests without delays/stalls is decreasing.
As most throughout metrics, l1tex__throughput also represents the maximum of a list of contributing sub-metrics. You can see (and collect) the breakdown of contributor metrics with
ncu --metrics breakdown::l1tex__throughput.avg.pct_of_peak_sustained_active <app>
[...]
l1tex__data_bank_reads.avg.pct_of_peak_sustained_active % 1.69
l1tex__data_bank_writes.avg.pct_of_peak_sustained_active % 1.69
l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_active % 4.86
l1tex__data_pipe_tex_wavefronts.avg.pct_of_peak_sustained_active % 0
l1tex__f_wavefronts.avg.pct_of_peak_sustained_active % 0.15
l1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active % 3.39
l1tex__lsuin_requests.avg.pct_of_peak_sustained_active % 9.03
l1tex__m_l1tex2xbar_req_cycles_active.avg.pct_of_peak_sustained_active % 5.99
l1tex__m_xbar2l1tex_read_sectors.avg.pct_of_peak_sustained_active % 9.02
l1tex__tex_writeback_active.avg.pct_of_peak_sustained_active % 0
l1tex__texin_sm2tex_req_cycles_active.avg.pct_of_peak_sustained_active % 0.15
tpc__l1tex_data_bank_reads.avg.pct_of_peak_sustained_active % 1.22
tpc__l1tex_data_bank_writes.avg.pct_of_peak_sustained_active % 1.22
tpc__l1tex_data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_active % 5.28
tpc__l1tex_m_l1tex2xbar_req_cycles_active.avg.pct_of_peak_sustained_active % 6.50
I see. Thanks for the reply. I didn’t know that l1tex__t_sector_hit_rate.pct
represents the hit rate, as I didn’t know what a “sector” means. Would you mind elaborating on that?
I also have a better understanding of the throughput metric now, and was wondering what 100% throughput mean. Does it mean that every cycle we read data from the L1 cache?
You can find an explanation of sector and many other terms in the Kernel Profiling Guide.
sector: Aligned 32 byte-chunk of memory in a cache line or device memory. An L1 or L2 cache line is four sectors, i.e. 128 bytes. Sector accesses are classified as hits if the tag is present and the sector-data is present within the cache line. Tag-misses and tag-hit-data-misses are all classified as misses.
I also recommend watching this GTC video for an excellent overview of the GPU caches and the associated terms and metrics in Nsight Compute.
Thank you.