How to correctly write code to test A100 L2 bandwidth?

 I have write code to test A100 L2 bandwidth, 128 thread per block, 10240 blocks totally, each thread read float4, all thread access 20MB totally. Already warmed up.
However,nsight compute report only 2.1TB/s from L2 to SM.
I wonder how to write to correct code test A100 L2 bandwidth, or where can I find a routine

Find out what the L2 size is on A100.
Divide that in two. Create two buffers each of which is that size (half the size of the L2, or maybe a little less).
Write a copy kernel that repeatedly copies one buffer to the other (maybe for 100 loops/iterations, or something like that). Use a grid-stride loop, with a grid size chosen to just fill the GPU.
Time it.

Something like this. I ran this on a H100, but it is more-or-less set up for A100 also. I get about 6.5TB/s L2 bw on H100, and that exceeds what is reported here FWIW. (I don’t happen to know what the A100 or H100 L2 cache bandwidth is offhand.)

# cat
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

// find largest power of 2
unsigned flp2(unsigned x) {
  x = x| (x>>1);
  x = x| (x>>2);
  x = x| (x>>4);
  x = x| (x>>8);
  x = x| (x>>16);
return x - (x>>1);

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;

using mt = unsigned long long;
__global__ void k(mt *d, mt *d2, int len, int lps){

  for (int l = 0; l < lps; l++)
    for (int i = threadIdx.x+blockDim.x*blockIdx.x; i<len; i+=gridDim.x*blockDim.x)
      d[i] = __ldcg(d2+i);

int main(){
  cudaDeviceProp prop;
  cudaGetDeviceProperties(&prop, 0);
  const int nTPSM = prop.maxThreadsPerMultiProcessor;
  const int nSM = prop.multiProcessorCount;
  const unsigned l2size = prop.l2CacheSize;
  unsigned sz = flp2(l2size)/2;
  sz = sz/sizeof(mt);  // approx 1/2 the size of the L2
  const int nTPB = 512; // block size
  const int nBLK = (nSM*nTPSM)/nTPB;
  const int loops = 100;
  mt *d, *d2;
  cudaMalloc(&d, sz*sizeof(mt));
  cudaMalloc(&d2, sz*sizeof(mt));
  k<<<nBLK, nTPB>>>(d, d2, sz, 1);  // warm-up
  unsigned long long dt = dtime_usec(0);
  k<<<nBLK, nTPB>>>(d, d2, sz, loops);
  dt = dtime_usec(dt);
  std::cout << "bw: " << (sz*2*sizeof(mt)*loops)/(float)dt << "MB/s" << std::endl;
# nvcc -o l2
# ./l2
bw: 6.49022e+06MB/s

The Citadel group did an analysis of v100 and found 2.1TB/s L2 cache bandwidth for loads only. You can read that paper to see what a carefully crafted kernel looks like.

The Citadel group also presented at GTC their findings on A100, where they reported a theoretical bandwidth on L2 of ~7TB/s (e.g. slide 33). I would not expect the 7TB number, being theoretical, to be achievable in an actual test.

I’m not aware of similar work done for H100. I would expect a load-only test to run faster than a load/store test like I have done, but I’m not sure by how much. Furthermore, on A100 at least, it may be necessary to pay attention to the partitioned A100 L2 cache to observe maximum numbers. I don’t have further details or suggestions to share about that.

This is my result, oops, only 2.7TB/s. I just copy the code and run. The device is A100-SXM4. I run nvidia-smi and make sure no other thread is working.