How to correctly write code to test A100 L2 bandwidth？

llyuan · August 31, 2023, 12:17am

 I have write code to test A100 L2 bandwidth, 128 thread per block, 10240 blocks totally, each thread read float4, all thread access 20MB totally. Already warmed up.
However，nsight compute report only 2.1TB/s from L2 to SM.
I wonder how to write to correct code test A100 L2 bandwidth, or where can I find a routine

Robert_Crovella · August 31, 2023, 3:15am

naively:
Find out what the L2 size is on A100.
Divide that in two. Create two buffers each of which is that size (half the size of the L2, or maybe a little less).
Write a copy kernel that repeatedly copies one buffer to the other (maybe for 100 loops/iterations, or something like that). Use a grid-stride loop, with a grid size chosen to just fill the GPU.
Time it.

Something like this. I ran this on a H100, but it is more-or-less set up for A100 also. I get about 6.5TB/s L2 bw on H100, and that exceeds what is reported here FWIW. (I don’t happen to know what the A100 or H100 L2 cache bandwidth is offhand. Later: the A100 white paper calls out expectations for the L2 cache bandwidth as a ratio to V100 (2.3x), as well as stating an explicit 5120 bytes/clock)

# cat l2.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

// find largest power of 2
unsigned flp2(unsigned x) {
  x = x| (x>>1);
  x = x| (x>>2);
  x = x| (x>>4);
  x = x| (x>>8);
  x = x| (x>>16);
return x - (x>>1);
}

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

using mt = unsigned long long;
__global__ void k(mt *d, mt *d2, int len, int lps){

  for (int l = 0; l < lps; l++)
    for (int i = threadIdx.x+blockDim.x*blockIdx.x; i<len; i+=gridDim.x*blockDim.x)
      d[i] = __ldcg(d2+i);
}

int main(){
  cudaDeviceProp prop;
  cudaGetDeviceProperties(&prop, 0);
  const int nTPSM = prop.maxThreadsPerMultiProcessor;
  const int nSM = prop.multiProcessorCount;
  const unsigned l2size = prop.l2CacheSize;
  unsigned sz = flp2(l2size)/2;
  sz = sz/sizeof(mt);  // approx 1/2 the size of the L2
  const int nTPB = 512; // block size
  const int nBLK = (nSM*nTPSM)/nTPB;
  const int loops = 100;
  mt *d, *d2;
  cudaMalloc(&d, sz*sizeof(mt));
  cudaMalloc(&d2, sz*sizeof(mt));
  k<<<nBLK, nTPB>>>(d, d2, sz, 1);  // warm-up
  cudaDeviceSynchronize();
  unsigned long long dt = dtime_usec(0);
  k<<<nBLK, nTPB>>>(d, d2, sz, loops);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  std::cout << "bw: " << (sz*2*sizeof(mt)*loops)/(float)dt << "MB/s" << std::endl;
}
# nvcc -o l2 l2.cu
# ./l2
bw: 6.49022e+06MB/s
#

The Citadel group did an analysis of v100 and found 2.1TB/s L2 cache bandwidth for loads only. You can read that paper to see what a carefully crafted kernel looks like.

The Citadel group also presented at GTC their findings on A100, where they reported a theoretical bandwidth on L2 of ~7TB/s (e.g. slide 33). I would not expect the 7TB number, being theoretical, to be achievable in an actual test.

I’m not aware of similar work done for H100. I would expect a load-only test to run faster than a load/store test like I have done, but I’m not sure by how much. Furthermore, on A100 at least, it may be necessary to pay attention to the partitioned A100 L2 cache to observe maximum numbers. I don’t have further details or suggestions to share about that.

llyuan · September 13, 2023, 8:27am

This is my result, oops, only 2.7TB/s. I just copy the code and run. The device is A100-SXM4. I run nvidia-smi and make sure no other thread is working.

chri-001 · October 12, 2023, 8:08am

Did you find any solution for this? I found this problem too only for A100.
I think maybe it’s related to the A100 L2 architecture that it’s divided in two partitions, but it is weird that it works on H100 then. I wanted to check L2 read and write throughput but I don’t have access to cupti on the server I’m using and I can’t really check.
Do you have any clue? @Robert_Crovella
Actually, I see that for A100 bandwidth should around 2,3x compared to V100, so around 5,5 TB/s (that is around half of what I found using this test).
I tried it also on a H100 (PCIE) and I got something around 5.8/6 TB/s, that I think it’s correct, but let me know what you think

Robert_Crovella · October 14, 2023, 4:28pm

That’s my guess also. For clarity, to match the usage in the A100 white paper, I would refer to these not as two partitions but two banks. Yes, the L2 cache is partitioned. But in A100 it also seems to have another hierarchical breakdown referred to as banks, called out separately from the partitioning and crossbar.

If the observation is related to the L2 banks in A100, then it does not seem weird to me. I’ve read the V100, A100, and H100 whitepapers, and the only one I see that refers to having two L2 banks is the A100.

I’m not sure what that means, exactly, but if you can’t use a profiler, then I would say your ability to do certain kinds of exploration in CUDA may be limited.

Thinking that the banks might be significant, and noting that inter-bank bandwidth utilization is reported for A100 by the nsight compute profiler, I thought I would try to see if I could refactor the above “naive” code to see if we might avoid transferring data between banks, in case that happens to be a bottleneck of sorts. (I don’t know that it is, just guessing here.) So my first cut at the problem was to try refactoring the above code so that each block did it’s own little copy operation, and each block worked in a different patch of L2. I tried addressing arrangements per SM that were ordered by SM ID as well as an even/odd arrangement.

That was basically unsuccessful - no improvement in bandwidth. So my next step if time permits will be to try experiments running block-level-copies on each SM, and play with the addressing patterns, and see if I can use the profiler (perhaps via a metric) to tell me the level of traffic that is flowing through the “bank connector”. In so doing, I may discover addressing patterns that allow each block to do a copy without sending traffic over the “bank connector”. If that doesn’t pay off, then I am out of ideas. Time is also a factor in terms of what I can do here, so if I have not made a response after this, it means I haven’t made further meaningful progress, and additional queries of the type like “any clue?” or “any update?” or the like, I will probably not respond to.

chri-001 · October 17, 2023, 9:10am

Can you tell me where did you find this information? I was checking the whitepapers too but I found this term used in the MIG section in both A100 and H100, so I am a bit confused. But they didn’t mention there are just two banks tho. I only saw references to two partitions, physically separated in both A100 and H100 (it’s possible to see them also from the figure representing the full GPU with its logic component).

Anyway, thank you so much for your help and time

Robert_Crovella · October 17, 2023, 1:11pm

In the A100 white paper, the following text appears:

The L2 cache is divided into two partitions to enable higher bandwidth and lower latency memory access. Each L2 partition localizes and caches data for memory accesses from SMs in
the GPCs directly connected to the partition.

There is similar text in the H100 whitepaper, although it doesn’t mention the word two.

So I may be off-base here. Just to be clear, I don’t have a recipe for the A100 and I cannot explain observed differences in V100, A100, H100 L2 bandwidth measurements.