Problem about L2 cache hit rate in A800

Hey, I was testing L2 cache hit rate in my A800 80GB, but I find i cant’t get ~100% hit rate.
My test code as follows:

#include <stdio.h>
#include <iostream>

template <typename T>

__global__ void k(volatile T * __restrict__ d1, const int loops, const int ds){

  T tmp0;
  for (int i = 0; i < loops; i++)
    for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
      if(i&1) tmp0 = d1[j];
}
// 1G
const int dsize = 1048576*1024;
const int iter = 1024;
int main(){

  int *d;
  cudaMalloc(&d, dsize);

  // case 2: 5M*4B = 20MB copy, should fit in L2 cache on A800

  int csize = 5*1048576;
  k<<<1024, 1024>>>(d, iter, csize);
  cudaDeviceSynchronize();
}

L2 cache size is 40MiB in A800 80GB, I tried to repeat read 20MB data in kernel.
Then I used ncu command to analysis L2 cache hit rate, output as follows:

==PROF== Profiling "k" - 0: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 969498
[969498] a.out@127.0.0.1
  void k<int>(volatile T1 *, int, int) (1024, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
    Section: Command line profiler metrics
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    lts__t_sector_hit_rate.pct           %        72.46
    -------------------------- ----------- ------------

I repeated my test in 4060ti GPU with 32MiB L2 cache, the L2 cache hit rate up to 99%.

My question is I repeated read 20MiB data, which half of L2 cache size, why L2 hit rate only 72.46%, does A800 GPU has some special about L2 cache architecture?

The A800 (similar to the A100) has a L2 cache divided into two halves.

IIRC if far memory (memory from the wrong half) is accessed the cache entry is copied first into one L2 half, then into the other.

So depending on the memory access pattern, you effectively could have only 20 MiB of L2 cache.

Hey, I changed data size from 20MiB to 10MiB, but the L2 cahce hit rate still is ~73%, It’s only up one percentage point.

Could you tell me where the information you got?

You are reporting only the L2 percentage, not the overall amount of memory, which was read through it.
Your shown kernel writes the read value into a local variable without further using it.

Could it be that the optimizer removes most/all of the accesses and what remains is just a few KiB being read (e.g. the program code, the program parameters)?

You could add up all the read values in the for loop and write back the result. Or compare the sum to a function parameter and only write, if they are identical. (Then you have no write in practice, but still the optimizer cannot optimize out the calculation).