Understanding cache throughput in Nsight

boringboringarsenal · July 29, 2021, 4:15pm

So NSight reports something called SOL L1/TEX Cache [%], which seems to be the L1 cache throughput. But I was wondering what throughput means for a cache? I’d think hit rate is a more common metric. How do we calculate a cache’s throughput? If a cache’s throughput means the number of bytes per second/cycle, it doesn’t seem like a higher cache throughput means better performance, since the program might not access the cache and so the throughput would be low?

felix_dt · July 30, 2021, 3:15pm

You can see the description and name of the metric in the table when hovering over the entry in the UI, verifying that this is indeed the throughput of the L1/TEX cache. The underlying metric name is l1tex__throughput.avg.pct_of_peak_sustained_active. This is indeed different from the hit which is shown as part of the Memory Workload Analysis section as L1/TEX Hit Rate (l1tex__t_sector_hit_rate.pct).

The idea of the throughput metrics in the GPU Speed Of Light section is not so much to show good or bad performance, but a measure of the unit’s utilization. Consider that there are multiple other HW clients accessing this cache and as the utilization increases to the maximum, its ability to handle additional incoming requests without delays/stalls is decreasing.

As most throughout metrics, l1tex__throughput also represents the maximum of a list of contributing sub-metrics. You can see (and collect) the breakdown of contributor metrics with

ncu --metrics breakdown::l1tex__throughput.avg.pct_of_peak_sustained_active <app>
[...]
l1tex__data_bank_reads.avg.pct_of_peak_sustained_active                              %                           1.69
l1tex__data_bank_writes.avg.pct_of_peak_sustained_active                             %                           1.69
l1tex__data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_active                     %                           4.86
l1tex__data_pipe_tex_wavefronts.avg.pct_of_peak_sustained_active                     %                              0
l1tex__f_wavefronts.avg.pct_of_peak_sustained_active                                 %                           0.15
l1tex__lsu_writeback_active.avg.pct_of_peak_sustained_active                         %                           3.39
l1tex__lsuin_requests.avg.pct_of_peak_sustained_active                               %                           9.03
l1tex__m_l1tex2xbar_req_cycles_active.avg.pct_of_peak_sustained_active               %                           5.99
l1tex__m_xbar2l1tex_read_sectors.avg.pct_of_peak_sustained_active                    %                           9.02
l1tex__tex_writeback_active.avg.pct_of_peak_sustained_active                         %                              0
l1tex__texin_sm2tex_req_cycles_active.avg.pct_of_peak_sustained_active               %                           0.15
tpc__l1tex_data_bank_reads.avg.pct_of_peak_sustained_active                          %                           1.22
tpc__l1tex_data_bank_writes.avg.pct_of_peak_sustained_active                         %                           1.22
tpc__l1tex_data_pipe_lsu_wavefronts.avg.pct_of_peak_sustained_active                 %                           5.28
tpc__l1tex_m_l1tex2xbar_req_cycles_active.avg.pct_of_peak_sustained_active           %                           6.50

boringboringarsenal · July 30, 2021, 3:22pm

I see. Thanks for the reply. I didn’t know that l1tex__t_sector_hit_rate.pct represents the hit rate, as I didn’t know what a “sector” means. Would you mind elaborating on that?

I also have a better understanding of the throughput metric now, and was wondering what 100% throughput mean. Does it mean that every cycle we read data from the L1 cache?

felix_dt · July 30, 2021, 3:33pm

You can find an explanation of sector and many other terms in the Kernel Profiling Guide.

sector: Aligned 32 byte-chunk of memory in a cache line or device memory. An L1 or L2 cache line is four sectors, i.e. 128 bytes. Sector accesses are classified as hits if the tag is present and the sector-data is present within the cache line. Tag-misses and tag-hit-data-misses are all classified as misses.

I also recommend watching this GTC video for an excellent overview of the GPU caches and the associated terms and metrics in Nsight Compute.

boringboringarsenal · July 30, 2021, 3:41pm

Thank you.

Topic		Replies	Views
Average of all kernels L1, L2 Cache Hit Rate Nsight Compute	8	195	February 20, 2025
Definitions of l1tex__cycles_active and l1tex__cycles_elapsed Nsight Compute nsight	3	1057	January 31, 2022
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	108	January 8, 2025
L1 hit rate stats according to nsight compute Nsight Compute	0	629	December 28, 2020
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1395	April 25, 2020
Question about cache metrics Nsight Compute	3	652	March 10, 2023
Nsight Compute: discrepancy in cache reports for OptiX applications Nsight Compute	8	611	July 13, 2021
Memory throughput definition Nsight Compute	3	967	June 25, 2024
Question about l1tex__data_pipe_lsu_wavefronts.avg Nsight Compute	8	291	April 23, 2025
What exactly does SM Active Cycles mean? Nsight Compute	3	1008	July 30, 2024

Understanding cache throughput in Nsight

Related topics