Problem about L2 cache hit rate in A800

Rookie_programmer · May 14, 2024, 4:52am

Hey, I was testing L2 cache hit rate in my A800 80GB, but I find i cant’t get ~100% hit rate.
My test code as follows:

#include <stdio.h>
#include <iostream>

template <typename T>

__global__ void k(volatile T * __restrict__ d1, const int loops, const int ds){

  T tmp0;
  for (int i = 0; i < loops; i++)
    for (int j = threadIdx.x+blockDim.x*blockIdx.x; j < ds; j += gridDim.x*blockDim.x)
      if(i&1) tmp0 = d1[j];
}
// 1G
const int dsize = 1048576*1024;
const int iter = 1024;
int main(){

  int *d;
  cudaMalloc(&d, dsize);

  // case 2: 5M*4B = 20MB copy, should fit in L2 cache on A800

  int csize = 5*1048576;
  k<<<1024, 1024>>>(d, iter, csize);
  cudaDeviceSynchronize();
}

L2 cache size is 40MiB in A800 80GB, I tried to repeat read 20MB data in kernel.
Then I used ncu command to analysis L2 cache hit rate, output as follows:

==PROF== Profiling "k" - 0: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 969498
[969498] a.out@127.0.0.1
  void k<int>(volatile T1 *, int, int) (1024, 1, 1)x(1024, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
    Warning: Data collection happened without fixed GPU frequencies. Profiling results may be inconsistent.
    Section: Command line profiler metrics
    -------------------------- ----------- ------------
    Metric Name                Metric Unit Metric Value
    -------------------------- ----------- ------------
    lts__t_sector_hit_rate.pct           %        72.46
    -------------------------- ----------- ------------

I repeated my test in 4060ti GPU with 32MiB L2 cache, the L2 cache hit rate up to 99%.

My question is I repeated read 20MiB data, which half of L2 cache size, why L2 hit rate only 72.46%, does A800 GPU has some special about L2 cache architecture?

Curefab · May 14, 2024, 10:06am

The A800 (similar to the A100) has a L2 cache divided into two halves.

IIRC if far memory (memory from the wrong half) is accessed the cache entry is copied first into one L2 half, then into the other.

So depending on the memory access pattern, you effectively could have only 20 MiB of L2 cache.

Rookie_programmer · May 14, 2024, 10:30am

Hey, I changed data size from 20MiB to 10MiB, but the L2 cahce hit rate still is ~73%, It’s only up one percentage point.

Could you tell me where the information you got?

Curefab · May 14, 2024, 10:49am

You are reporting only the L2 percentage, not the overall amount of memory, which was read through it.
Your shown kernel writes the read value into a local variable without further using it.

Could it be that the optimizer removes most/all of the accesses and what remains is just a few KiB being read (e.g. the program code, the program parameters)?

You could add up all the read values in the for loop and write back the result. Or compare the sum to a function parameter and only write, if they are identical. (Then you have no write in practice, but still the optimizer cannot optimize out the calculation).

Topic		Replies	Views
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	924	March 22, 2023
L2cache size of A800 80GB CUDA Programming and Performance	3	704	April 17, 2024
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2733	July 3, 2024
L2 Cache mechanism for streaming data? CUDA Programming and Performance	1	752	August 25, 2022
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	1872	October 17, 2023
Question about GPU L2 cache memory access。 Nsight Compute cuda , kernel	5	1014	February 21, 2024
L2 cache in A100 provides 179% hit rate! Nsight Compute	1	726	January 4, 2023
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1365	April 25, 2020
L2 cache in A100 provides 179% hit rate! CUDA Programming and Performance	7	1328	December 25, 2022
Problem about A800 80GB GPU memory bandwidth test CUDA Programming and Performance	2	290	March 18, 2024

Problem about L2 cache hit rate in A800

Related topics