Average of all kernels L1, L2 Cache Hit Rate

Hi,
In my design I currently have 5 kernels and I could check their L1,L2 hit rates separately with the memory chart individually from the dropdown menu by selecting each kernel. I would like to get the application average of L1, L2 hit rates. Is this possible ?

There is this topic from 3 years ago and I was wondering is it an added feature now and if not how could I get this data in an alternative way ?

I found a way which is from “Raw” tab exporting the csv of l1tex__t_sector_hit_rate and lts__t_sector_hit_rate and then taking the average of all with a simple formulation with python script.

Is there any less hassle way ?

You can copy/export the values from the UI, you can pass --csv to the ncu command line to print results as csv formatted or you can use the Python Report Interface to access metric values from an existing report file. Finally, the UI’s raw page also shows the aggregate of selected values in the bottom status bar, if that is helpful.

It is not recommended to take the average or averages as magnitude is lost.

The first formula is used in NCU. The second is the one recommended to use if you want the correct value across kernels. You can add the additional metrics used in the MULTI_KERNEL using --metric

l1tex__t_sector_hit_rate                = l1tex__t_sectors_lookup_hit.avg / l1tex__t_sectors.avg
MULTI_KERNEL_l1tex__t_sector_hit_rate   = SUM_KERNELS(l1tex__t_sectors_lookup_hit.sum) / SUM_KERNELS(l1tex__t_sectors.sum)

--metrics=l1tex__t_sectors_lookup_hit.sum,l1tex__t_sectors.sum

lts__t_sector_hit_rate                  = lts__t_sectors_lookup_hit.avg / lts__t_sectors.avg
MULTI_KERNEL_lts__t_sector_hit_rate     = SUM_KERNELS(l1tex__t_sectors_pipe_lsu_lookup_hit.sum) / SUM_KERNELS(lts__t_sectors.sum)

--metrics=l1tex__t_sectors_pipe_lsu_lookup_hit.sum,lts__t_sectors.sum
1 Like

Hi @Greg ,

Thank you for this information, how about for the throughput data for L1,L2 and DRAM ?

My current strategy is :

  • for DRAM throughput:
AVERAGE_KERNELS (gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed)
  • For L2 ↔ Device Memory Throughput :
AVERAGE_KERNELS(dram__bytes.sum.per_second` +  dram__bytes_read.sum.per_second)
  • For L1 <> L2 Cache Throughput:
AVERAGE_KERNELS(lts__t_sectors_srcunit_tex.sum.per_second)

I highly recommend not using average or averages.

DRAM_THROUGHPUT = KERNEL_SUM(dram__bytes.sum) / KERNEL_SUM(gpu__time_duration.sum)

// all throughput to/fom L2
L2_THROUGHPUT_TOTAL = KERNEL_SUM(lts__t_bytes.sum) / KERENL_SUM(gpu__time_duration.sum)

// all throughput to/fom L2 for DEVICEMEM - a hit to aperture_device may not show up in DRAM_THROUGHPUT
L2_THROUGHPUT_DEVICEMEM = KERNEL_SUM(lts__t_sectors_aperture_device) x 32 / KERENL_SUM(gpu__time_duration.sum)

The L2_THROUGHPUT can be adjusted for different apertures, srcunits, and ops.

2 Likes

Hi @Greg , one quick question would be, the gpu__time_duration.sum is approximately 2 times less than the one I capture with the std::chrono::steady_clock::now(). Which one would be the more reliable one to use for bandwidth calculation ?

I assume you mean a std::chrono measurement on the host.

start = std::chrono::steady_clock::now();
foo<<<blocks,threads_per_block,...>>>(...);
cudaDeviceSynchronoze();
end = std::chrono::steady_clock::now();

This timing will include all of the driver (and in windows potentially kernel mode) overhead. This will added 8-10 microseconds if not more.

gpu__time_duration.sum is from when the front end issues the launch to when the launch completes all thread blocks and completes the membar on any inflight information. If you are optimizing a kernel you want to use the device time. If you are optimizing the application as you don’t have sufficient work queued to the GPU then you may want to use a host timing. If possible the application should queue up work to the GPU and minimize synchronization that limit GPU activity.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.