Some general questions that we couldn’t find answers for in the documentation:
What are the replacement policies in the Fermi L1 and L2 Caches?
What are the write policies? If we change a global value in L1 cache, does it change in L2 and global memory or do we only do a mark as dirty value and flush the writes later?
Is the cache policy a multilevel inclusion one (L1 is ALWAYS present in L2), or is it exclusion as in L1 and L2 unified cache(L1 is NEVER in L2)
We ran the following kernels which are very simple but we were interested in the cache hits/miss and read requests.
(I have attached a PNG of the results below since I couldn’t create a table here)
l1 global load miss l2 read requests l2 read misses
ReadWriteTest1 1 4 0
ReadWriteTest2 2 8 4
ReadWriteTest3 32 128 4
ReadWriteTest4 4 16 5
ReadWriteTest5 2 8 8
As you can see the l1 global load misses are as expected but:
Why is the L2 read misses in test1 equals to 0? Shouldn’t there be a miss when the threads try to look for values in L2 after L1 was a miss the very first time?
Given that test2 has to load two 128 byte cache lines, shouldn’t the l2 read miss be 8 since there are two cache lines each of 4 32byte reads?
Why do we only get 4 misses on Test3 even though we had to load 32 cache lines?
nvidia doesn’t share much information on that. So you’ll have to find out the replacement policy on your own. Good luck.
I tend to believe L1 write is buffered because of the presence of the write through modifier for the st instruction.
What is on L1 should be on L2 as well because I’m under the impression that L1 and L2 can be non-coherent. I may be wrong.
btw, if you’re going to check the replacement policy on your own, don’t rely on the profiler. Use %clock to measure the latency to see if it’s a miss or a hit. I believe the profiler is not so reliable. It gives funny numbers from time to time.
As hyqneuron wrote, don’t rely on the profiler. It samples only a subset of the SMs and memory controllers. And as the memory addresses as hashed (to prevent partition camping), you cannot know which memory controller a memory access goes to (unless you decipher the hash first…).
from the above table, I have following queries. can any one please help in this regard…
Q.1) As line size in L1 and L2 are same 128Byte. All L2 requests are missed as each consecutive requests to L2 is 128byte apart so no L2 hit found. Why L2 miss is nearly equal to L2 request but not exactly equal?
Q.2) Why DRAM read Request is nearly Half of L2 read miss?
Q.3) What are the latencies of L1 and L2? readWriteTest.pdf (20.4 KB)
You’ll be able to dig out everything on your own, if you are willing to get into all the low-level details. I wrote an assembler for the Fermi ISA which should be enough for you to find out most of the things, as long as you already have some knowledge of the GPU hardware. The link is in my signature.
Of course, if you don’t want to go into the details, you can do some simpler, high-level tests first. For that the paper tera mentioned is a good starting point.
Still, if your interest lasts and if your are patient enough, you can wait until me and a few other collaborators get back to the assembler project in next January. Then we will reveal as much as we can, and the things mentioned in this thread are certainly on the list.
L2 miss getting halfed on DRAM. Is this due to the 128 byte transaction gets converted into 64byte. All the results I have captured through the Computeprofiler of CUDA framework. Is this a reliable source of data?