Fermi L2 cache How fast is the L2 cache? How do I access it?

I need to be able to transfer data between different blocks but without using global memory because it’s way too slow. In the NVIDIA Fermi card, there’s a new L2 cache that goes across the whole GPU.
How many clock cycles does it take to access (global memory takes ~400-600 clock cycles)? How do I use this cache?

I’m also interested in whether it’s possible to use this cache in CUDA Fortran using the pgi compiler. I’ve posted a similar question on the pgi user forum relating to this matter. If it’s not possible to do in Fortran then that’s fine, I can write it in CUDA C and interface it to my existing Fortran program.

Thanks

I’ve not done any measurements myself, but from the documentation I believe that the L2 cache has latency similar to global memory and exists only to maximize bandwidth, not latency. I might be wrong on this though.

You don’t need any language specific features, all local and global memory accesses go through the L2 cache by default.

Well according to the Fermi white paper the “L2 provides efficient, high speed data sharing across the GPU” but it doesn’t specify exactly how fast it is, only how big it is (768KB). I need to be able to figure out how fast it is so I can include this in my research thesis.

So if I declare a variable with the global memory attribute, then by default it will use the L2 cache? Does this mean that if the 768KB limit is reached, then the global memory will spill over to the slower DRAM? What I want to be able to do is to use the DRAM for large global memory arrays, and then use the L2 cache for small global memory arrays that are accessed more regularly.

The L2 cache is (almost) transparent. The only control you have is through cache operation modifiers in PTX instructions, which would allow to mark the touched cachelines as “evict first”. I don’t know about the PGI compiler, but even Nvidia’s CUDA C does not support this in any other way than through inline PTX.

But if I stay within the 768KB limit, then does that mean that only the L2 cache is used?

It depends on locality of your memory access.

L2 cache serves for ALL data accesses.

Suppose you have multiple arrays, A1, A2, A3, …, and you just want to bind A1 to L2 cache,

this is impossible because L2 serves for other arrays, i.e. any data access of other arrays would flush L2 cache.

I would recommand texture cache (read only) if you don’t write data to A1.

Well say I have one large 2-dimensional array that I want to store in L2 cache, and have read/write access to different parts of it multiple times? Is this impossible to do?

The L2 caching is automatic, so there is nothing you have to do beyond reading and writing the array as you normally would.

Since the L2 is for the entire chip, it is automatically coherent from the perspective of all threads. However, the L1 is per-multiprocessor and almost certainly not coherent (i.e., a multiprocessor won’t snoop another’s L1 cache to find changed memory locations), but if you are writing CUDA code without race conditions, this won’t be a problem. Atomic operations bypass the L1 and go directly to the L2. At the end of the kernel, all the caches are guaranteed to write their contents back to device memory. This means that you can write locations in one kernel and read the same location in the next from a different block.

If you are doing something other than that, then you will need to worry more about race conditions, beyond cache consistency. There are memory fence functions that will guarantee writes have been flushed up to different levels in the memory hierarchy, but they are not synchronization tools, so be careful.

Hello, just in case anyone wants to know. I looked online for the performance of L2 and did not really find anything. I read in places that it was much faster and in other places that it was about the same. Well, the last half of today I played around with the performance aspects of the L1 and L2 cache (mostly L2). I did about 5 different experiments and I think the last one (below) is the most accurate and best represents the performances. It might not be exactly right but hopefully it is a good indication and helpful to some people. Here are my results:

Clock ticks to access…
1184 non-cached (L1 enabled but not used)
1184 non-cached (L1 disabled)
528 L2 cached (L1 disabled)
252 L1 cached (L1 enabled and used)
164 (overhead for timing codes - nothing processed)

same values as above but with the overhead(164) subtracted out…
1020 non-cached (L1 enabled but not used)
1020 non-cached (L1 disabled)
365 L2 cached (L1 disabled)
88 L1 cached (L1 enabled and used)

So, in summary, with my testing, it looks like L2 is about 3 times quicker then global memory. Global memory(for my test case on my card) get me about 1000 clock ticks and L2 gets me about 350 clock ticks.

Notes:

  • tested on a GTX 450.
  • each of the times is just a little high(50-60 ticks) because the times include a store to shared and an address calculations
  • All tests were done in release mode.
  • For the above tests I used 128 threads X 1 Block.
  • To disable L1 I used “-Xptxas -dlcm=cg”.
  • To test L2 cache I basically turned off L1 cache and pre-read the data and then timed the second read.
  • To test L1 I did the same but enabled L1.
  • To test no cache I timed the first access to the memory.(having L1 enabled or disabled did not make a difference as expected.)

here is the code I used…

global static void timerFunction(float * global, float * output, clock_t * timer) {
extern shared float shared;
const int tid = threadIdx.x;

//use this to cache the data
//shared[tid] = global[tid];
//shared[tid] = global[tid];

__syncthreads(); if (tid == 0) timer[0] = clock(); __syncthreads();//start timer

shared[tid] = global[tid];

__syncthreads(); if (tid == 0) timer[1] = clock(); __syncthreads(); //stop timer

output[tid] = shared[tid]; }

I’ve made some benchmarks few months ago. I will share my results soon.

But I have questions to sunsetquest:

Why did you setup 128 threads? Why not 32 (number of cores / SM) or the total number of cores?
Why did you measure only the first thread’s clocks?
Why did you write back the result to global memory from shared memory? It is enough to store it in shared memory.

Hi Mallee, Thank you for your questions - I mostly wanted to determine if L2 did offered a speedup and if so by roughly how much. I could not find any other tests or info online on whether there was a speedup at all or if L2 was just to minimize data on the bus.

Your questions…
Why did you setup 128 threads? Why not 32 (number of cores / SM) or the total number of cores? That is a good point. I cannot remember why I choose this number. It probably would have been better to run it with 32 threads like you suggested so that there is nothing else happening on the SM besides the one running warp. I re-ran the test and it changed the results slightly. It was just a couple percent slower when running 1024 threads vs 32 threads. BTW – I chose 1024 threads because it 2/3rds occupies a SM. For some reason it would not let me run 1536 threads – I did not investigate this.

Why did you measure only the first thread’s clocks? I really just wanted to time a particular warp. Since all the threads in a warps execute in parallel 0-31 or 32-63… then I don’t think it matters much what thread we run test. I was not sure if this was correct so I double checked this on some different threads (with 1024 threads running) and received similar scores.

Why did you write back the result to global memory from shared memory? It is enough to store it in shared memory. The compiler is smart. It knows that if nothing from shared memory is ultimately saved back to global memory then nothing in shared memory really matters and it optimizes it out by deleting it.

When I re-ran the test I did notice that I was off on the overhead by ~50 clock cycles. This lowers the L2 access time to about 1/4 the time. (Instead of 1/3) I corrected it above.

I would be interested in seeing your results so we can compare them.

As I have been interested in the cache access times too, I wanted to check your measurements.

So here are my results measured on a GeForce GTX 560 Ti and Quadro 6000 for which I got the same results (in cycles cleaned from overhead):

1060 non-cached

248 L2

18 L1

I also tried to access volatile data, which seems to use the L2 cache.

I furthermore measured the time for using atomicAdd, which also seems to use the L2 cache.

For this test I increased the number of atomic operations on the same dataword using a modulo operation on the thread index:

1390 atomic non-cached

580 atomic L2

1380 atomic non-cached 2x conflict

570 atomic L2 2x conflict

1490 atomic non-cached 4x conflict

600 atomic L2 4x conflict

1900 atomic non-cached 8x conflict

1145 atomic L2 8x conflict

2980 atomic non-cached 16x conflict

2198 atomic L2 16x conflict

4680 atomic non-cached 32x conflict

3910 atomic L2 32x conflict

Here is the code if you want to compare it:

__global__ void timerFunction(float * global, clock_t * timings) 

{

  __shared__ volatile float shared[32];

  shared[threadIdx.x] = 0;

__syncthreads();

  clock_t t[2*RUNS];

  for(uint i = 0; i < RUNS; ++i)

  {

    __syncthreads();

    t[2*i] = clock();

    shared[threadIdx.x] =  global[threadIdx.x]; 

    //shared[threadIdx.x] = threadIdx.x; //computing the overhead

    //shared[threadIdx.x] = atomicAdd(global+(threadIdx.x%CONFLICTMODULO), 1.0f); // atomic test

    t[2*i+1] = clock();

    __syncthreads();

  }

  __syncthreads();

  for(uint i = 0; i < 2*RUNS; ++i)

    timings[threadIdx.x*2*RUNS + i] = t[i];

}

greets