Set persisting area on L2 cache

Hello, I’m recently studying the ways to control L2 Cache on GPU. I studied the way to set the “persisting area” on L2 cache known as “set-aside” part. AFAIK, the data on the global memory’s access policy window, can only access set-aside part of l2 cache. The statement of nvidia document [cuda-c-programming-guide] deepend my thought.

The statement :
" * With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window in the set-aside L2 cache area. Since the set-aside area is smaller than the window, cache lines will be evicted to keep the most recently used 16KB of the 32KB data in the set-aside portion of the L2 cache. "

However, the experiment i executed suggested otherwise. I created one of the convolution kernels in Densenet121 I first thought if i set the input matrix, weight matrix, and output matrix in global memory’s access policy window and set the set-aside size really low, then the duration of kernel should increase. However, the result says otherwise. The result is below

Im using RTX 3090 device which has 6mb of L2 cache, and i set the batch size of input as 32 so I can fully saturate L2 cache and SM on the device.

Kernel Durartion(without L2 Cache Control) : 3.169ms
Kernel Durartion(with Set-Aside size : 3MB) : 3.188ms
Kernel Durartion(with Set-Aside size : 0.005MB) : 3.177ms

I think duration of kernel should be the lowest with (without L2 Cache Control) since it fully utilizes L2 cache, and then the second lowest duration should be (with Set-Aside size : 3MB). However, it seems like it doesn’t really have any difference. I would really appreciate any help.

Thank you in advance

The difference between the kernel durations is very small. Perhaps the speeds are compute-bound and not memory-bound?

1 Like