K20 vs C2075 L1 cache

I have an odd problem and it is impacting performance.

I wrote a simple convolution kernel and when ever I run my kernel on a c2075, I get about 93% L1 cache hit, which is great. But then when I run it on my K20, that suddenly drops to 0% which is not good at all. Is there any reason why this would be the case? From all my kernels that I have ran on the K20, I have never seen any L1 usage, only L2. Is there some mechanism that I can use to control this behavior?

thnx

Also, I should say that I changed nothing other than the card…

Much to the surprise of many of us, this is how compute capability 3.x devices function. Global memory reads are no longer cached in L1:

https://devtalk.nvidia.com/default/topic/540044/cuda-programming-and-performance/question-on-the-l1-caching-of-the-gk-110/post/3785428/#3785428

On compute capability 3.5 (like the K20), you do have access to a read-only data cache, which you can use explicitly with the __ldg() function or implicitly my marking your pointers with the appropriate const restrict qualifiers. See the CUDA C Programming Guide ver. 5.5 for details.

That is odd, what was the rational behind such a change, it seems to me like if you happened to not know about the const restrict hint or the __ldg() then you would essentially be missing out on hardware when in previous versions you did not.

Also something I found odd, I added the const restrict and re-ran on the K20, greatly improved bandwidth(About 3x according to nsight) but the L1 is still unused, it seems to have routed the the const restrict data through the texture cache.

I have not seen the rationale explained officially, but I could speculate…

The shift in design from the SM of Fermi to the SMX of Kepler changed the ratio of CUDA cores to registers quite dramatically. The number of CUDA cores per multiprocessor went up by a factor of 6, but the number of registers only went up by a factor of 2. Assuming you have to scale up the number of active threads on a multiprocessor proportional to the number of CUDA cores (not exactly true, but close enough), you will have a lot more pressure on the register file on Kepler. Achieving good occupancy in register-heavy kernels will require spilling registers to local memory more often than before.

On Fermi, these local memory spills went into the same L1 cache as the global memory accesses, and I would not be surprised if there was some “cache-thrashing” as these two different kinds of memory access fought over cache lines. One way to solve this problem is to split these two uses into different caches. The texture cache was already there and usage of it for read-only arrays was already an old CUDA trick. It looks like compute capability 3.5 simply expanded it to a full 48 kB per multiprocessor (used to be more like 6-8 kB) and exposed a way to directly use it for normal memory loads.

Note that I have no benchmarks or quantitative analysis that demonstrates this tradeoff in the usage of L1 was a good idea for common kernels. I can only try to connect the dots based on NVIDIA’s published choices.