If I have a kernel that consumes more than the L1 cache available, do I have advantage to switch off L1 cache and use only L2 cache? If so how I switch off (and back) L1 cache on fly?
The underlieing PTX instruction set has some modifiers for the load instruction and perhaps other instructions as well, to change caching behaviour and potentially bypass it.
I think this feature is also leaked in high level language (CUDA C) perhaps via keyword “volatile” and perhaps others… so consult manuals… (perhaps restrict or so… had to do something with loading… might be of some use).
CUDA API also has some functionality to specific cache preference… might help as well to reduce L1 cache size or increase it…
Concerning L2 cache… that remains a mystery to me ;) Perhaps there is something new for it in API or documentation ? ;)
If the instructions memory access is highly divergent and address ranges are only access 1 time then there can be bandwidth savings associated with performing uncached global loads. Caching can be controlled on a per instruction basis using inline PTX. The L1 cache can also be disabled using the compiler option -dlcm.
For more information see the Global Memory section in the CUDA Programming Guide for the compute capability of your GPU.
If you are developing for a compute capability 3.5 device you may also want to investigate the LDG instruction which performs read-only global access through the texture cache. The texture cache can have better performance for highly divergent memory accesses and if the application is heavily accessing shared, local, or global memory.
All device memory accesses always go through L2.
All system memory accesses currently are not cached in L2.