Fermi Cache Architecture Cache, write policy, read policy, architecture

  1. Some general questions that we couldn’t find answers for in the documentation:

    • What are the replacement policies in the Fermi L1 and L2 Caches?

    • What are the write policies? If we change a global value in L1 cache, does it change in L2 and global memory or do we only do a mark as dirty value and flush the writes later?

    • Is the cache policy a multilevel inclusion one (L1 is ALWAYS present in L2), or is it exclusion as in L1 and L2 unified cache(L1 is NEVER in L2)

  2. We ran the following kernels which are very simple but we were interested in the cache hits/miss and read requests.

__global__ void ReadWriteTest1(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id] + 1;   

}

__global__ void ReadWriteTest2(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 2] + 1; 

}

__global__ void ReadWriteTest3(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 32 ] + 1;     

}

__global__ void ReadWriteTest4(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 4] + results[id * 4 +1] + 1;    

}

__global__ void ReadWriteTest5(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id + 7] + 1; 

}

The results from the profiler were:

(I have attached a PNG of the results below since I couldn’t create a table here)

l1 global load miss l2 read requests l2 read misses

ReadWriteTest1 1 4 0

ReadWriteTest2 2 8 4

ReadWriteTest3 32 128 4

ReadWriteTest4 4 16 5

ReadWriteTest5 2 8 8

As you can see the l1 global load misses are as expected but:

  • Why is the L2 read misses in test1 equals to 0? Shouldn’t there be a miss when the threads try to look for values in L2 after L1 was a miss the very first time?

  • Given that test2 has to load two 128 byte cache lines, shouldn’t the l2 read miss be 8 since there are two cache lines each of 4 32byte reads?

  • Why do we only get 4 misses on Test3 even though we had to load 32 cache lines?

Your help is greatly appreciated
Results.png

nvidia doesn’t share much information on that. So you’ll have to find out the replacement policy on your own. Good luck.

I tend to believe L1 write is buffered because of the presence of the write through modifier for the st instruction.
What is on L1 should be on L2 as well because I’m under the impression that L1 and L2 can be non-coherent. I may be wrong.

btw, if you’re going to check the replacement policy on your own, don’t rely on the profiler. Use %clock to measure the latency to see if it’s a miss or a hit. I believe the profiler is not so reliable. It gives funny numbers from time to time.

As hyqneuron wrote, don’t rely on the profiler. It samples only a subset of the SMs and memory controllers. And as the memory addresses as hashed (to prevent partition camping), you cannot know which memory controller a memory access goes to (unless you decipher the hash first…).

Timing instructions via %clock is the only reliable source of information I know about. I assume you’ve already read Demystifying GPU Microarchitecture through Microbenchmarking, which demonstrates how much info you can deduce from timing.

Thank you so much for the fast reply and the link. We will certainly look at the paper.

Hi,
I am using Quadro600 (Compute Capability 2.1), and performed experiment using ReadWriteTest2,3 and 4 as mentioned in this topic.

global void ReadWriteTest2(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 2] + 1;
}
global void ReadWriteTest3(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 32 ] + 1;
}
global void ReadWriteTest4(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 4] + results[id * 4 +1] + 1;
}

My block size was 256 and CTA varies 1,2,3 and son on…


CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 16 128 121 64
2 16 32 256 185 88
3 24 48 384 273 140
4 32 64 512 305 148

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 256 2048 2104 1052
2 16 512 4096 4120 2060
3 24 768 6144 6144 3068
4 32 1024 8192 8200 4100

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 16 32 256 272 135
2 32 64 512 472 231
3 48 96 768 636 315
4 64 128 1024 864 431

from the above table, I have following queries. can any one please help in this regard…

Q.1) As line size in L1 and L2 are same 128Byte. All L2 requests are missed as each consecutive requests to L2 is 128byte apart so no L2 hit found. Why L2 miss is nearly equal to L2 request but not exactly equal?

Q.2) Why DRAM read Request is nearly Half of L2 read miss?

Q.3) What are the latencies of L1 and L2?
readWriteTest.pdf (20.4 KB)

You’ll be able to dig out everything on your own, if you are willing to get into all the low-level details. I wrote an assembler for the Fermi ISA which should be enough for you to find out most of the things, as long as you already have some knowledge of the GPU hardware. The link is in my signature.

Of course, if you don’t want to go into the details, you can do some simpler, high-level tests first. For that the paper tera mentioned is a good starting point.

Still, if your interest lasts and if your are patient enough, you can wait until me and a few other collaborators get back to the assembler project in next January. Then we will reveal as much as we can, and the things mentioned in this thread are certainly on the list.

thanks a lot…

I m exploring your assembler…

L2 miss getting halfed on DRAM. Is this due to the 128 byte transaction gets converted into 64byte. All the results I have captured through the Computeprofiler of CUDA framework. Is this a reliable source of data?