The underlieing PTX instruction set has some modifiers for the load instruction and perhaps other instructions as well, to change caching behaviour and potentially bypass it.
I think this feature is also leaked in high level language (CUDA C) perhaps via keyword “volatile” and perhaps others… so consult manuals… (perhaps restrict or so… had to do something with loading… might be of some use).
CUDA API also has some functionality to specific cache preference… might help as well to reduce L1 cache size or increase it…
Concerning L2 cache… that remains a mystery to me ;) Perhaps there is something new for it in API or documentation ? ;)