If the instructions memory access is highly divergent and address ranges are only access 1 time then there can be bandwidth savings associated with performing uncached global loads. Caching can be controlled on a per instruction basis using inline PTX. The L1 cache can also be disabled using the compiler option -dlcm.
For more information see the Global Memory section in the CUDA Programming Guide for the compute capability of your GPU.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-2-x
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x
If you are developing for a compute capability 3.5 device you may also want to investigate the LDG instruction which performs read-only global access through the texture cache. The texture cache can have better performance for highly divergent memory accesses and if the application is heavily accessing shared, local, or global memory.
All device memory accesses always go through L2.
All system memory accesses currently are not cached in L2.
There are no additional controls for L2.