NCU profiling shows unexpected results

Hello, I’m recently studying and trying to utilize L2 cache performance.
AFAIK, If we set the base_pointer and window_size, and set hit_ratio as 1 that device memory area only can access to limited L2 cache area(reffered as set-aside in NVIDIA Document)
Below is the statment in NVIDIA Cuda-Programming Docuement

  • With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window in the set-aside L2 cache area. Since the set-aside area is smaller than the window, cache lines will be evicted to keep the most recently used 16KB of the 32KB data in the set-aside portion of the L2 cache.
    "

However, when I profile the kernel “implicit_convolve_sgemm” with nsight compute, even when I set limited L2 cache zone as 1MB(I’m currently using RTX 3090 which has 6MB of L2 cache) and put entire workload in the window(input, weight, output), I get same L2 cache hit ratio even though I expect low L2 cache hit ratio when L2 is limited

Below is the result.
when L2 cache not limited, L2 hit ratio: 97.77%
when L2 cache limited, L2 hit ratio: 97.78%

It seems like the data in window is accessing entire L2 cache zone or NSIGHT COMPUTE is not aware of L2 cache limitation.
Can someone please explain?
Thank you in advance!

The L2 cache on most chips is much larger than the specified size you have supplied. In addition you are already at 97.77% L2 hit rate so there is not likely to be a reduction or significant gain from evict last policy. I would recommend a test case where the data set far exceeds the L2 cache capacity and the persisted data is (1) accessed multiple times, and (2) the persisted data size is less than or equal to the percent of cache for persisted data. Please note that non-persisted data (evict normal, evict streaming) can reside in the area of cache for evict last if sufficient reads have not been performed to make reside evict last cache lines.