I think one of my algorithms disturb cache space of some variables that are used by multiple cores by loading too much data (only once per thread and not other threads). How can I stop it? From PTX or some compiler option for some kernel parameters?
Even though it has only 256kB L2 cache, some ordering(z?) support should make it good enough but unfortunately some other one-time-used variables are loaded through L1(and L2?) and they are like 800MB or something and probably making caches not very helpful. I know, I need to have a better algorithm but that will take time and I just wonder this L1-L2 bypassing ability.
The L2 activity is unavoidable. All data retrieved from device memory flows through the L2. You cannot bypass it.
L1 should not actually be enabled for global loads on Kepler Quadro K420
The L1 can be bypassed with a non-caching load. There are plenty of descriptions of this available on the web if you care to search for them.
[url]Kepler Tuning Guide :: CUDA Toolkit Documentation
On cc 3.5 and higher devices (your is cc 3.0) the usual advice for read-once data is to load it through the read-only cache. This still impacts L2 however.
As read-only cache, I should use textures and constant memory right? Just putting a const(and restrict) before a parameter didn’t make a difference for my setup.
I’m using Nvrtc and driver API so maybe I can embed some data on directly string of kernels? They come from instruction cache maybe? I mean, replicating something as if compiler is doing it for me, as a an array of operations. For example, having 800MB of streaming through inside kernel, without any parameters.(but I’m staying away for now from this)
Correct, for devices of compute capability less than 3.5, your options for read-only traffic optimization are texture and constant
on devices of cc 3.5 and higher, decorating a global pointer with const restrict should be a strong hint to the compiler to use the “read-only” cache for that load traffic.