Pascal L1 cache

GM10x, GM20x, and GP10x have very similar TEX/L1 designs. Starting with GM20x TEX/L1 caching of global memory loads can be enabled on non-constant data. See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-5-x.

For all Maxwell - Pascal the L1/TEX cache line size is 128B consisting of 4 32B sectors. On a load miss only the 32B sectors in the cache line that were accessed are loaded from L2.

The TEX/L1 cache will make the number of 32B requests required to satisfy all threads. Additional sectors in the cache line that are not accessed will not be pre-fetched. The CUDA profilers have enough TEX and L2 counters to write a quick test to show this behavior.