Average of all kernels L1, L2 Cache Hit Rate

kaan.olgu · January 25, 2025, 6:55pm

Hi,
In my design I currently have 5 kernels and I could check their L1,L2 hit rates separately with the memory chart individually from the dropdown menu by selecting each kernel. I would like to get the application average of L1, L2 hit rates. Is this possible ?

There is this topic from 3 years ago and I was wondering is it an added feature now and if not how could I get this data in an alternative way ?

kaan.olgu · January 26, 2025, 2:19pm

I found a way which is from “Raw” tab exporting the csv of l1tex__t_sector_hit_rate and lts__t_sector_hit_rate and then taking the average of all with a simple formulation with python script.

Is there any less hassle way ?

felix_dt · January 27, 2025, 7:41am

You can copy/export the values from the UI, you can pass --csv to the ncu command line to print results as csv formatted or you can use the Python Report Interface to access metric values from an existing report file. Finally, the UI’s raw page also shows the aggregate of selected values in the bottom status bar, if that is helpful.

Greg · January 27, 2025, 6:18pm

It is not recommended to take the average or averages as magnitude is lost.

The first formula is used in NCU. The second is the one recommended to use if you want the correct value across kernels. You can add the additional metrics used in the MULTI_KERNEL using --metric

l1tex__t_sector_hit_rate                = l1tex__t_sectors_lookup_hit.avg / l1tex__t_sectors.avg
MULTI_KERNEL_l1tex__t_sector_hit_rate   = SUM_KERNELS(l1tex__t_sectors_lookup_hit.sum) / SUM_KERNELS(l1tex__t_sectors.sum)

--metrics=l1tex__t_sectors_lookup_hit.sum,l1tex__t_sectors.sum

lts__t_sector_hit_rate                  = lts__t_sectors_lookup_hit.avg / lts__t_sectors.avg
MULTI_KERNEL_lts__t_sector_hit_rate     = SUM_KERNELS(l1tex__t_sectors_pipe_lsu_lookup_hit.sum) / SUM_KERNELS(lts__t_sectors.sum)

--metrics=l1tex__t_sectors_pipe_lsu_lookup_hit.sum,lts__t_sectors.sum

kaan.olgu · January 27, 2025, 6:33pm

Hi @Greg ,

Thank you for this information, how about for the throughput data for L1,L2 and DRAM ?

My current strategy is :

for DRAM throughput:

AVERAGE_KERNELS (gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed)

For L2 ↔ Device Memory Throughput :

AVERAGE_KERNELS(dram__bytes.sum.per_second` +  dram__bytes_read.sum.per_second)

For L1 <> L2 Cache Throughput:

AVERAGE_KERNELS(lts__t_sectors_srcunit_tex.sum.per_second)

Greg · January 27, 2025, 6:45pm

I highly recommend not using average or averages.

DRAM_THROUGHPUT = KERNEL_SUM(dram__bytes.sum) / KERNEL_SUM(gpu__time_duration.sum)

// all throughput to/fom L2
L2_THROUGHPUT_TOTAL = KERNEL_SUM(lts__t_bytes.sum) / KERENL_SUM(gpu__time_duration.sum)

// all throughput to/fom L2 for DEVICEMEM - a hit to aperture_device may not show up in DRAM_THROUGHPUT
L2_THROUGHPUT_DEVICEMEM = KERNEL_SUM(lts__t_sectors_aperture_device) x 32 / KERENL_SUM(gpu__time_duration.sum)

The L2_THROUGHPUT can be adjusted for different apertures, srcunits, and ops.

kaan.olgu · February 5, 2025, 6:25pm

Hi @Greg , one quick question would be, the gpu__time_duration.sum is approximately 2 times less than the one I capture with the std::chrono::steady_clock::now(). Which one would be the more reliable one to use for bandwidth calculation ?

Greg · February 6, 2025, 8:35pm

I assume you mean a std::chrono measurement on the host.

start = std::chrono::steady_clock::now();
foo<<<blocks,threads_per_block,...>>>(...);
cudaDeviceSynchronoze();
end = std::chrono::steady_clock::now();

This timing will include all of the driver (and in windows potentially kernel mode) overhead. This will added 8-10 microseconds if not more.

gpu__time_duration.sum is from when the front end issues the launch to when the launch completes all thread blocks and completes the membar on any inflight information. If you are optimizing a kernel you want to use the device time. If you are optimizing the application as you don’t have sufficient work queued to the GPU then you may want to use a host timing. If possible the application should queue up work to the GPU and minimize synchronization that limit GPU activity.

system · February 20, 2025, 8:36pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
L1 and L2 cache hit rate CUDA Programming and Performance	8	6770	February 3, 2016
Performance counter descriptions ? L2 cache hit/miss ratio missing ? CUDA Programming and Performance	0	2768	March 25, 2012
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1492	April 25, 2020
How to profile L1 and L2 hit ratios on Tesla C2050 cards using the command-line profiler? CUDA Programming and Performance	1	1512	June 8, 2013
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4151	December 23, 2015
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	703	August 13, 2019
Cache hit rate analysis Profiling Embedded Targets	2	2509	July 12, 2013
L1 cache hits 0% CUDA Programming and Performance	2	1154	June 1, 2013
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4943	April 10, 2009
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7109	November 2, 2010

Average of all kernels L1, L2 Cache Hit Rate

Related topics