Hello, I’m recently studying and trying to utilize L2 cache performance.
AFAIK, If we set the base_pointer and window_size, and set hit_ratio as 1 that device memory area only can access to limited L2 cache area(reffered as set-aside in NVIDIA Document)
Below is the statment in NVIDIA Cuda-Programming Docuement
- With a
hitRatio
of 1.0, the hardware will attempt to cache the whole 32KB window in the set-aside L2 cache area. Since the set-aside area is smaller than the window, cache lines will be evicted to keep the most recently used 16KB of the 32KB data in the set-aside portion of the L2 cache.
"
However, when I profile the kernel “implicit_convolve_sgemm” with nsight compute, even when I set limited L2 cache zone as 1MB(I’m currently using RTX 3090 which has 6MB of L2 cache) and put entire workload in the window(input, weight, output), I get same L2 cache hit ratio even though I expect low L2 cache hit ratio when L2 is limited
Below is the result.
when L2 cache not limited, L2 hit ratio: 97.77%
when L2 cache limited, L2 hit ratio: 97.78%
It seems like the data in window is accessing entire L2 cache zone or NSIGHT COMPUTE is not aware of L2 cache limitation.
Can someone please explain?
Thank you in advance!