Hey, I was testing L2 cache hit rate in my A800 80GB, but I find i cant’t get ~100% hit rate.
My test code as follows:
#include <stdio.h>
#include <iostream>
template <typename T>
__global__ void k(volatile T * __restrict__ d1, const int loops, const int ds){
T tmp0;
for (int i = 0; i < loops; i++)
for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
if(i&1) tmp0 = d1[j];
}
// 1G
const int dsize = 1048576*1024;
const int iter = 1024;
int main(){
int *d;
cudaMalloc(&d, dsize);
// case 2: 5M*4B = 20MB copy, should fit in L2 cache on A800
int csize = 5*1048576;
k<<<1024, 1024>>>(d, iter, csize);
cudaDeviceSynchronize();
}
L2 cache size is 40MiB in A800 80GB, I tried to repeat read 20MB data in kernel.
Then I used ncu
command to analysis L2 cache hit rate, output as follows:
==PROF== Profiling "k" - 0: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 969498
[969498] a.out@127.0.0.1
void k<int>(volatile T1 *, int, int) (1024, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
Section: Command line profiler metrics
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
lts__t_sector_hit_rate.pct % 72.46
-------------------------- ----------- ------------
I repeated my test in 4060ti GPU with 32MiB L2 cache, the L2 cache hit rate up to 99%.
My question is I repeated read 20MiB data, which half of L2 cache size, why L2 hit rate only 72.46%, does A800 GPU has some special about L2 cache architecture?