Tesla K40 L2 bandwidth

I have an application where L2 reuse can potentially speed up the execution. However, I see that the code runs faster when L2 hit rate is lower (I have a knob that can vary the L2 hit rate).

So I decided to run a micro benchmark experiment measuring the L2 and device memory bandwidth for my Tesla K40 GPU.

Here’s the small test :

#include <stdio.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>

#define SIZE (1024*1024)

__global__ void withl2(int *a, int *b) {
	int tid = threadIdx.x;
	int tid2 = blockIdx.x*blockDim.x + threadIdx.x;
	a[tid2] = b[tid];
}

__global__ void withoutl2(int *a, int *b) {
	int tid = blockIdx.x*blockDim.x + threadIdx.x;
	a[tid] = b[tid];
}

int main() {
	int *a, *b;
	checkCudaErrors(cudaMalloc(&a, SIZE*sizeof(int)));
	checkCudaErrors(cudaMalloc(&b, SIZE*sizeof(int)));
	int threads = 128;
	int blocks = SIZE/128;
		
	withl2<<<blocks,threads>>>(a, b);
	checkCudaErrors(cudaDeviceSynchronize());
	withoutl2<<<blocks,threads>>>(a, b);
	checkCudaErrors(cudaDeviceSynchronize());
		
	return 0;
}

In the kernel withl2, different threadblocks access the same sections of array b (0 to 127). Hence, for almost all of the blocks, all accesses to b should be found in the L2 cache.

In the kernel withoutl2, each threadblock accesses different sections of the array. Hence, there should be no L2 hit at all.

I ensured that the total number of accesses is huge by choosing a very high number of total threads. L1 cache is switched off by default, since all accesses are global The ECC option is disabled for the GPU, resulting in the maximum possible device memory bandwidth.

Here are the results:

Kernel : withl2 : withoutl2

L2 hit rate (reads) : 99.99% : 0%
Execution time (us) : 45 : 38
Bandwidth achieved (read) : 83 GBps (L2 throughput) : 99.66 GBps (Device memory throughput)

This is really surprising since I would expect the kernel withl2 to achieve better BW, since all accesses are being hit in the L2. Can someone please provide an explanation?

Also, is there a reference quoting the L2 bandwidth numbers for the K40 architecture?

Thanks in advance!

It doesn’t quite answer your question, but if you want to see the effect of L1 on loads instead of just L2 then you might want to read up on “-Xptxas=-dlcm=ca” and other cache modifier options:

Here’s your benchmark run 1000 times on a Maxwell GM204:

> nvcc -arch sm_52 l2.cu
> nvprof --print-gpu-summary a
Time(%)      Time     Calls       Avg       Min       Max  Name
 47.02%  55.448ms      1000  55.447us  55.360us  55.872us  withL2(int*, int*)    <-- FAST
 35.45%  41.801ms      1000  41.801us  40.737us  43.616us  withoutL2(int*, int*) <-- FASTER
 17.53%  20.676ms      1000  20.676us  20.160us  21.376us  withLDG(int*, int*)   <-- FASTEST (L1)

… and now with “-dlcm=ca”:

> nvcc -Xptxas=-dlcm=ca -arch sm_52 l2.cu
> nvprof --print-gpu-summary a
Time(%)      Time     Calls       Avg       Min       Max  Name
 50.23%  41.787ms      1000  41.787us  40.640us  43.488us  withoutL2(int*, int*) <-- FAST
 24.89%  20.705ms      1000  20.705us  20.192us  21.504us  withLDG(int*, int*)   <-- FASTEST (L1)
 24.88%  20.694ms      1000  20.693us  20.224us  21.345us  withL2(int*, int*)    <-- FASTEST (L1)

You might see expected L2 benefits if your microbenchmark performed more than one load and/or widened the span of the redundant load.

FWIW, increasing the grid size and widening the span of the redundant loads shows that withL2() is slightly faster withoutL2() but … microbenchmarking is tricky so I’m not claiming these changes reveal anything useful. :)

Thanks!

Although you are using a different architecture (maxwell), you seem to get the same results wherein L2 looks slower than the device memory.

About the L1 cache, I understand the results would be better if I turn it on, and I had tried it for the microbenchmark. However, for my real application, due to a lot of other memory accesses, as well as due to accesses which aren’t 128-byte aligned, the L1 benefits die.

So I still need some reasoning as why the L2 cache bandwidth is slower.

The L2 has an unpublished (AFAIK) internal structure that does not provide peak bandwidth from any given cell/line. In order to observe peak bandwidth from the L2, the line requests must be distributed, to some degree.

In your posted code, change 128 to 512 in these lines:

int threads = 128;
int blocks = SIZE/128;

and rerun your tests. I believe the results will be instructive. Alternatively, change this line in your withl2 kernel:

a[tid2] = b[tid];

to this:

a[tid2] = b[0];

to witness movement in “the other direction”.

Further exposure of the underlying structure could be arrived at through microbenchmarking, and could possibly vary from gpu type to gpu type. Such microbenchmarking may already be available in the literature, I don’t know for sure.

@rasputin456, unfortunately I don’t have a Kepler GPU in my rig.

The Kepler and Maxwell Tuning Guides describe the L1/L2 default behavior and how it differs from Fermi.

My GM204 device reports 2MB of L2.

As @txbob hints, a 128-wide span (512 bytes) in L2 is not very distributed.

512 bytes will fit in L1 though.

You can increase SPAN from (1<<7) to (1<<19) to see how the runtime varies (using nvprof --print-gpu-summary ):

#include <stdio.h>
#include <cuda_runtime.h>

#define SIZE    (1*1024*1024)
#define THREADS 128
#define SPAN    (1<<18) // 2^18 = 1MB of L2 

__global__ void withoutL2(int *a, int *b) {
  int tid = blockIdx.x*blockDim.x + threadIdx.x;
  for (int ii=0; ii<10; ii++)
    a[tid] = b[tid];
}

__global__ void withL2(int *a, int *b) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int ii=0; ii<10; ii++)
    a[tid] = b[tid & (SPAN-1)];
}

__global__ void withLDG(int *a, int *b) {
  int tid = blockIdx.x * blockDim.x + threadIdx.x;
  for (int ii=0; ii<10; ii++)
    a[tid] = __ldg(&b[tid & (SPAN-1)]);
}

int main() {
  int *a, *b;
  cudaMalloc(&a, SIZE*sizeof(int));
  cudaMalloc(&b, SIZE*sizeof(int));

  int threads = THREADS;
  int blocks  = SIZE/THREADS;
		
  for (int ii=0; ii<1000; ii++)
    withoutL2<<<blocks,threads>>>(a, b);
  cudaDeviceSynchronize();

  for (int ii=0; ii<1000; ii++)
    withL2<<<blocks,threads>>>(a, b);
  cudaDeviceSynchronize();

  for (int ii=0; ii<1000; ii++)
    withLDG<<<blocks,threads>>>(a, b);
  cudaDeviceSynchronize();

  cudaDeviceReset();

  return 0;
}

Thanks txbob! Choosing a threadblock size of 512 started showing improved L2 bandwidth.

@txbob, allanmac - I am wondering that even if I understand the internal structure through microbenchmarking, how I could make effective use of it in my program. In my application, multiple threadblocks would read from the same large array. This array is large enough to not fit in L1, but can fit in the L2. However, the order in which multiple threadblocks read from this array is not predetermined. How can I obtain a good distribution in the L2 then?

The L2 on a GK110 is 1.5 MB or 384K 32-bit words. That’s really small.

If the array is less than 1.5 MB then it’s relatively small/fast compared to the 288 GB/sec. of bandwidth you have available.

In another thread in this forum we were micro-benchmarking the impact of the TLB on the global memory load path.

The L2 was found to be very fast:

Metric Name          Metric Description      Min         Max         Avg
l2_read_throughput   L2 Throughput (Reads)   288.25GB/s  520.31GB/s  419.66GB/s

For example, if you randomly load 400 MB of segments all within the same ~2MB region you reach 390 GB/sec on a GM204. If you read 4GB of segments the load throughput rises to 500 GB/sec. It seems to top out at around 521 GB/sec on my GTX 980.

That’s ~2x the throughput of device memory.

But… this was a micro-benchmark and I think there are few real-world kernels that could exploit the L2 cache to an observable extreme (anyone?).

Another suggestion, if the array in your app is read-only for the lifetime of the kernel then you might want to try using the LDG read-only cache fastpath and see if that is performant enough for your use case.

Also, note that your modified test is already exercising the L2:

> nvprof -m l2_read_throughput a.exe
Invocations                         Metric Name    Metric Description     Min         Max         Avg
Device "GeForce GTX 980 (0)"
        Kernel: withLDG(int*, int*)
       1000                   l2_read_throughput   L2 Throughput (Reads)  39.008GB/s  42.154GB/s  41.536GB/s
        Kernel: withL2(int*, int*)
       1000                   l2_read_throughput   L2 Throughput (Reads)  371.50GB/s  401.03GB/s  396.46GB/s
        Kernel: withoutL2(int*, int*)
       1000                   l2_read_throughput   L2 Throughput (Reads)  366.04GB/s  391.64GB/s  387.62GB/s

The withLDG() kernel still “wins” even though it doesn’t generate much L2 activity:

Time(%)      Time     Calls       Avg       Min       Max  Name
 34.62%  102.21ms      1000  102.21us  101.41us  103.23us  withoutL2(int*, int*)
 33.98%  100.32ms      1000  100.32us  99.457us  101.67us  withL2(int*, int*)
 31.39%  92.679ms      1000  92.679us  91.521us  94.241us  withLDG(int*, int*) <-- FASTEST

But the difference between all three is under 5%. :)

@allanmac : If you drop the inner loop in the kernels (ii loop), the BW reduces by 3x, and hit rate goes to 0%. My application has different threadblocks accessing overlapping memory locations, all of whom can fit in the L2 cache. Hence, I am testing a microbenchmark where the locality used is across blocks. In the example you provided, that’s not the case.

I am already using the LDG path for reading a different array, and reading this other array through the LDG leads to a slowdown.

Ah, I see your point. 0% is troubling.

Maybe some of these can help you figure it out:

> nvprof --devices 0 --query-metrics | grep l2
            l2_read_transactions:  Memory read transactions seen at L2 cache for all read requests
           l2_write_transactions:  Memory write transactions seen at L2 cache for all write requests
            l2_tex_read_hit_rate:  Hit rate at L2 cache for all read requests from texture cache
           l2_tex_write_hit_rate:  Hit Rate at L2 cache for all write requests from texture cache
          l2_tex_read_throughput:  Memory read throughput seen at L2 cache for read requests from the texture cache
         l2_tex_write_throughput:  Memory write throughput seen at L2 cache for write requests from the texture cache
        l2_tex_read_transactions:  Memory read transactions seen at L2 cache for read requests from the texture cache
       l2_tex_write_transactions:  Memory write transactions seen at L2 cache for write requests from the texture cache
              l2_read_throughput:  Memory read throughput seen at L2 cache for all read requests
             l2_write_throughput:  Memory write throughput seen at L2 cache for all write requests
                  l2_utilization:  The utilization level of the L2 cache relative to the peak utilization
            l2_atomic_throughput:  Memory read throughput seen at L2 cache for atomic and reduction requests
          l2_atomic_transactions:  Memory read transactions seen at L2 cache for atomic and reduction requests

One other thought… If your grid is not larger than ~500-1000 warps then I wonder if there is only a limited opportunity to prime the L2 on a large GPU like the K40?

An initial wave of 15 multiprocessors executing up to 64 resident warps/multiprocessor could immediately place 500-1000 128 byte loads in flight.

Without knowing exactly how the GK110B L2 is implemented, I can envision most of the transactions in that first wave missing the L2 and being enqueued in the six memory controllers.

Good point Allanmac - regardless of internal architecture all warps of the initial wave will suffer the latency of the global memory accesses for populating the cache. So the problem size is too small to show the benefit of L2 on a large GPU like the K40.

On a related note, in the concrete benchmark the second kernel will benefit from the populated cache. Warming the cache or increasing the problem size would solve that as well.

The smallest grid I run with has 240 warps. However, each warp executes significantly large workload, and in each of its iteration (among a total of 30), accesses this large array that can fit in the L2 cache. Hence, I think cache warming will automatically be taking place.

I am currently microbenchmarking to find out the internal structure of L2. I haven’t learnt much though :)

I think allanmac is on to something, and your performance may be limited by the way such a small grid interacts with the memory controllers rather than the cache. If memory serves, conventional wisdom says you would want to launch at least twenty “waves” of thread blocks of four warps each to maximize the performance of the memory subsystem. So a grid would ideally be no smaller than #SMs x 4 warps x 20 for optimal memory performance.

It is possible that on the Maxwell architecture there is better memory subsystem efficiency for small grids, but I have yet to try it. The general design space for a dual-level cache hierarchy is large, so without some solid hints as to the basic approach chosen for Kepler I do not see how running a few microbenchmarks is going to clarify the behavior (but I’d be happy to be proven wrong). Unfortunately NVIDIA is not very forthcoming with relevant information, preferring to keep a tight lid on the microarchitecture of their GPUs.