Fermi Cache Architecture Cache, write policy, read policy, architecture

malbow · May 24, 2011, 1:10am

Some general questions that we couldn’t find answers for in the documentation:
- What are the replacement policies in the Fermi L1 and L2 Caches?
- What are the write policies? If we change a global value in L1 cache, does it change in L2 and global memory or do we only do a mark as dirty value and flush the writes later?
- Is the cache policy a multilevel inclusion one (L1 is ALWAYS present in L2), or is it exclusion as in L1 and L2 unified cache(L1 is NEVER in L2)
We ran the following kernels which are very simple but we were interested in the cache hits/miss and read requests.

__global__ void ReadWriteTest1(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id] + 1;   

}

__global__ void ReadWriteTest2(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 2] + 1; 

}

__global__ void ReadWriteTest3(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 32 ] + 1;     

}

__global__ void ReadWriteTest4(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id * 4] + results[id * 4 +1] + 1;    

}

__global__ void ReadWriteTest5(int* results)

{

  int id = threadIdx.x + blockDim.x * blockIdx.x; 

  results[id] = results[id + 7] + 1; 

}

The results from the profiler were:

(I have attached a PNG of the results below since I couldn’t create a table here)

l1 global load miss l2 read requests l2 read misses

ReadWriteTest1 1 4 0

ReadWriteTest2 2 8 4

ReadWriteTest3 32 128 4

ReadWriteTest4 4 16 5

ReadWriteTest5 2 8 8

As you can see the l1 global load misses are as expected but:

Why is the L2 read misses in test1 equals to 0? Shouldn’t there be a miss when the threads try to look for values in L2 after L1 was a miss the very first time?
Given that test2 has to load two 128 byte cache lines, shouldn’t the l2 read miss be 8 since there are two cache lines each of 4 32byte reads?
Why do we only get 4 misses on Test3 even though we had to load 32 cache lines?

Your help is greatly appreciated

hyqneuron · May 24, 2011, 7:35am

nvidia doesn’t share much information on that. So you’ll have to find out the replacement policy on your own. Good luck.

I tend to believe L1 write is buffered because of the presence of the write through modifier for the st instruction.
What is on L1 should be on L2 as well because I’m under the impression that L1 and L2 can be non-coherent. I may be wrong.

btw, if you’re going to check the replacement policy on your own, don’t rely on the profiler. Use %clock to measure the latency to see if it’s a miss or a hit. I believe the profiler is not so reliable. It gives funny numbers from time to time.

tera · May 24, 2011, 9:09am

As hyqneuron wrote, don’t rely on the profiler. It samples only a subset of the SMs and memory controllers. And as the memory addresses as hashed (to prevent partition camping), you cannot know which memory controller a memory access goes to (unless you decipher the hash first…).

Timing instructions via %clock is the only reliable source of information I know about. I assume you’ve already read Demystifying GPU Microarchitecture through Microbenchmarking, which demonstrates how much info you can deduce from timing.

malbow · May 24, 2011, 8:01pm

Thank you so much for the fast reply and the link. We will certainly look at the paper.

Arun_Parakh · August 31, 2011, 5:07am

Hi,
I am using Quadro600 (Compute Capability 2.1), and performed experiment using ReadWriteTest2,3 and 4 as mentioned in this topic.

global void ReadWriteTest2(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 2] + 1;
}
global void ReadWriteTest3(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 32 ] + 1;
}
global void ReadWriteTest4(int* results)
{
int id = threadIdx.x + blockDim.x * blockIdx.x;
results[id] = results[id * 4] + results[id * 4 +1] + 1;
}

My block size was 256 and CTA varies 1,2,3 and son on…

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 16 128 121 64
2 16 32 256 185 88
3 24 48 384 273 140
4 32 64 512 305 148

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 256 2048 2104 1052
2 16 512 4096 4120 2060
3 24 768 6144 6144 3068
4 32 1024 8192 8200 4100

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 16 32 256 272 135
2 32 64 512 472 231
3 48 96 768 636 315
4 64 128 1024 864 431

from the above table, I have following queries. can any one please help in this regard…

Q.1) As line size in L1 and L2 are same 128Byte. All L2 requests are missed as each consecutive requests to L2 is 128byte apart so no L2 hit found. Why L2 miss is nearly equal to L2 request but not exactly equal?

Q.2) Why DRAM read Request is nearly Half of L2 read miss?

Q.3) What are the latencies of L1 and L2?
readWriteTest.pdf (20.4 KB)

hyqneuron · August 31, 2011, 3:14pm

You’ll be able to dig out everything on your own, if you are willing to get into all the low-level details. I wrote an assembler for the Fermi ISA which should be enough for you to find out most of the things, as long as you already have some knowledge of the GPU hardware. The link is in my signature.

Of course, if you don’t want to go into the details, you can do some simpler, high-level tests first. For that the paper tera mentioned is a good starting point.

Still, if your interest lasts and if your are patient enough, you can wait until me and a few other collaborators get back to the assembler project in next January. Then we will reveal as much as we can, and the things mentioned in this thread are certainly on the list.

Arun_Parakh · August 31, 2011, 3:47pm

thanks a lot…

I m exploring your assembler…

L2 miss getting halfed on DRAM. Is this due to the 128 byte transaction gets converted into 64byte. All the results I have captured through the Computeprofiler of CUDA framework. Is this a reliable source of data?

Topic		Replies	Views
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23452	March 21, 2011
Disabling cache on Fermi architectures Try to disable L1 and L2 CUDA Programming and Performance	11	9240	August 30, 2013
Fermi L2 cache How fast is the L2 cache? How do I access it? CUDA Programming and Performance	11	26113	December 2, 2011
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1286	August 28, 2019
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14906	May 20, 2010
Cache Characterization: Strange L2 Behavior CUDA Programming and Performance	4	17890	April 20, 2011
Memory programming model of Fermi CUDA Programming and Performance	12	5564	March 22, 2010
32B memory transactions on Fermi? CUDA Programming and Performance	13	2216	November 27, 2010
how is reconfigurable cache/memory implemented? CUDA Programming and Performance	10	3014	December 22, 2010
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	6991	November 2, 2010

Fermi Cache Architecture Cache, write policy, read policy, architecture

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 16 128 121 64 2 16 32 256 185 88 3 24 48 384 273 140 4 32 64 512 305 148

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

1 8 256 2048 2104 1052 2 16 512 4096 4120 2060 3 24 768 6144 6144 3068 4 32 1024 8192 8200 4100

CTA (SM)|gld req (SM)|L1 gld miss (SM)|L2 read req (FB)|L2 read miss (FB)|DRAM read (FB)

Related topics

1 8 16 128 121 64
2 16 32 256 185 88
3 24 48 384 273 140
4 32 64 512 305 148

1 8 256 2048 2104 1052
2 16 512 4096 4120 2060
3 24 768 6144 6144 3068
4 32 1024 8192 8200 4100