Adjusting L2 Persistence Cache Hit Ratio

kevin.tong · January 9, 2024, 7:06pm

Hi all,

In the best practices guide Best Practices Guide :: CUDA Toolkit Documentation there’s a discussion on persistence cache memory that follow up on the programming guide: CUDA C++ Programming Guide

In particular the hit ratio is introduced as a way to avoid cache thrashing (ie. useful cache being evicted when the access window exceeds the persistence cache size and hit ratio is 1). In the micro-benchmark from the first link, a hit ratio < 1 gives a random chance to a memory access being either streaming or persistence. If hit ratio is 0.5, then half of all memory accesses to the access window are treated as streaming, and potentially useful cache is first to be evicted.

Given that the micro-benchmark code could be considered random memory accesses to the persistent cache (ie. no pattern), why does the lower hit ratio generate any performance boost, if cache is still being evicted?

data_persistent[tid % freqSize] = 2 * data_persistent[tid % freqSize];
data_streaming[tid % dataSize] = 2 * data_streaming[tid % dataSize];

The access patterns described here: parallel processing - What is the L2 cache accessPolicyWindow introduced in CUDA 11 - Stack Overflow don’t seem to match

Many thanks.

Robert_Crovella · January 18, 2024, 4:27pm

The use of the word random there is approximately correct if we posit that the data access pattern is random. However there is not really a random chance that a given access will be “persistence” i.e. cached. There is a specific pattern based on requested address, it is not random.

Think of it this way. For an access window of 2MB, and a persistence cache size of 1MB, and a 0.5 hit ratio, then the data will be cached as follows, each X or Y is eight bytes:

addr    data
  0: XXXXXXXXXXXXXXXX
128: YYYYYYYYYYYYYYYY
256: XXXXXXXXXXXXXXXX
512: YYYYYYYYYYYYYYYY
768: XXXXXXXXXXXXXXXX
...

So your code is accessing this region. If the access falls on an area indicated by X, it will be cached in the persistence cache (“persistence”). If it falls on an area indicated by Y, it will not be cached in the persistence cache (“streaming”).

This is an example that I advance for understanding. Please do not assume that the pattern indicated above is the exact pattern you should expect. It may be different. But the general idea is that some portion of the footprint will be cached, some portion won’t, and the ratio of cached to uncached region can be inferred from the supplied hit ratio.

Topic		Replies	Views
Does L2 cache hit ratio have nothing to do with L2 cache persistence? CUDA Programming and Performance	1	21	April 18, 2025
What is the expected L1/L2 hit rate for fully coalesced accesses? CUDA Programming and Performance	10	109	January 8, 2025
Can we allocate several L2 cache access window? CUDA Programming and Performance	0	463	August 1, 2022
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	934	March 22, 2023
Cuda L1 bypass performance CUDA Programming and Performance	4	1302	March 30, 2022
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1222	April 7, 2024
NCU profiling shows unexpected results Nsight Compute	2	126	May 23, 2025
Memory transaction size CUDA Programming and Performance	1	1734	February 12, 2017
Wanted to know about setting aside L2 cache memory Jetson AGX Orin cuda	4	232	April 22, 2024
Misaligned Data Access Has No Effect on Performance? CUDA Programming and Performance	7	2160	May 24, 2018

Adjusting L2 Persistence Cache Hit Ratio

Related topics