How to correctly write code to test A100 L2 bandwidth?

naively:
Find out what the L2 size is on A100.
Divide that in two. Create two buffers each of which is that size (half the size of the L2, or maybe a little less).
Write a copy kernel that repeatedly copies one buffer to the other (maybe for 100 loops/iterations, or something like that). Use a grid-stride loop, with a grid size chosen to just fill the GPU.
Time it.

Something like this. I ran this on a H100, but it is more-or-less set up for A100 also. I get about 6.5TB/s L2 bw on H100, and that exceeds what is reported here FWIW. (I don’t happen to know what the A100 or H100 L2 cache bandwidth is offhand. Later: the A100 white paper calls out expectations for the L2 cache bandwidth as a ratio to V100 (2.3x), as well as stating an explicit 5120 bytes/clock)

# cat l2.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

// find largest power of 2
unsigned flp2(unsigned x) {
  x = x| (x>>1);
  x = x| (x>>2);
  x = x| (x>>4);
  x = x| (x>>8);
  x = x| (x>>16);
return x - (x>>1);
}

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

using mt = unsigned long long;
__global__ void k(mt *d, mt *d2, int len, int lps){

  for (int l = 0; l < lps; l++)
    for (int i = threadIdx.x+blockDim.x*blockIdx.x; i<len; i+=gridDim.x*blockDim.x)
      d[i] = __ldcg(d2+i);
}

int main(){
  cudaDeviceProp prop;
  cudaGetDeviceProperties(&prop, 0);
  const int nTPSM = prop.maxThreadsPerMultiProcessor;
  const int nSM = prop.multiProcessorCount;
  const unsigned l2size = prop.l2CacheSize;
  unsigned sz = flp2(l2size)/2;
  sz = sz/sizeof(mt);  // approx 1/2 the size of the L2
  const int nTPB = 512; // block size
  const int nBLK = (nSM*nTPSM)/nTPB;
  const int loops = 100;
  mt *d, *d2;
  cudaMalloc(&d, sz*sizeof(mt));
  cudaMalloc(&d2, sz*sizeof(mt));
  k<<<nBLK, nTPB>>>(d, d2, sz, 1);  // warm-up
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  k<<<nBLK, nTPB>>>(d, d2, sz, loops);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "bw: " << (sz*2*sizeof(mt)*loops)/(float)dt << "MB/s" << std::endl;
}
# nvcc -o l2 l2.cu
# ./l2
bw: 6.49022e+06MB/s
#

The Citadel group did an analysis of v100 and found 2.1TB/s L2 cache bandwidth for loads only. You can read that paper to see what a carefully crafted kernel looks like.

The Citadel group also presented at GTC their findings on A100, where they reported a theoretical bandwidth on L2 of ~7TB/s (e.g. slide 33). I would not expect the 7TB number, being theoretical, to be achievable in an actual test.

I’m not aware of similar work done for H100. I would expect a load-only test to run faster than a load/store test like I have done, but I’m not sure by how much. Furthermore, on A100 at least, it may be necessary to pay attention to the partitioned A100 L2 cache to observe maximum numbers. I don’t have further details or suggestions to share about that.