But in figure 9, it looks like the GPU L2 and CPU L3 is directly connected to the memory controllers and the “4MB system cache” is parallel to the core caches, which is also directly connected to the DRAM.
Could someone confirm what exactly the cache hierarchy is?
So the system cache should be a “L4” cache. Figure 2 is correct, and it would be great if the Figure 9 can be improved so people like me don’t get confused.
Meanwhile, I did some profiling using Rodinia-3.1 compute benchmark suite with various data size in Nsight Compute. I noticed that L2 writes always hit. I run the same workloads on a Tesla V100, a 2060 and a 3070. Non of them shows the same behavior. Could this be something potentially related to the system cache?