If you are developing for a compute capability 3.5 device you may also want to investigate the LDG instruction which performs read-only global access through the texture cache. The texture cache can have better performance for highly divergent memory accesses and if the application is heavily accessing shared, local, or global memory.
Compute Capability 3.5 (Kepler) GPU had separate L1/SHM and Texture Caches. The LDG instruction enabled loading constant data through the texture cache vs. the L1 cache. This increased the effective “L1” cache space and throughput over only using the L1/SHM cache. The LD (load generic) instruction was used on Kepler to read through the L1 cache.
Kepler LDG information can be found in the Kepler Tuning Guide under section L1 Cache and Read-only Data Cache. There were also numerous GTC presentations on how to use the read-only cache (LDG instruction).
Kepler L1 and Texture Cache
On a L1 cached load a miss to a sector is promoted to a cache line miss (4 x 32B sectors = 128B) potentially resulting in over fetching from L2.
On a L1 un-cached load a miss is not promoted so only missed sectors wi fetched from the L2.
The texture cache will only fetch missed sectors (like un-cached L1).
Kepler SM has 2 (gk208, gk20a) or 4 texture caches (gk10x, gk110, *). There is a fixed relationship between a SM sub-partition (warp scheduler) and a texture cache. Using the LDG instruction to read the same data from all warps (multiple SM-subpartitions) may result in the same data being resident in all texture caches. The texture caches in the SM are not coherent. This is why access can only be read-only. To gain the benefit of the full cache footprint of the N texture caches it is useful to access different addresses per warp.
On a load accessing divergent addresses (different cache lines) the SM warp scheduler has to replay the instruction. This is also true on misses. In contrast address divergence is handled in the texture cache avoiding the loss in math throughput due to instruction relays.
The texture and L1 data cache share the request path to L2 and the return path from L2.
Compute Capability 5.x (Maxwell) - 6.x (Pascal) unified the L1 and Texture cache but moved SHM to a separate unit. The LDG instruction was introduced to force a read through the unified instruction cache to differentiate between generic load and global load. Generic loads have a slight penalty compared to global loads as the LSU unit has to determine if the address is shared, local, or global. Additional serialization is required if threads in the same instruction access multiple address windows (shared, local, and global). It is preferred where possible to use LDS (load shared), LDL (load local), and LDG (load global).
Compute Capability 7.x+ (Volta, Turing, Ampere, Ada, Hopper) unified the L1 Data Cache, Texture Cache, and Shared Memory into a single unit. The ISA matches Compute Capability 5.x above.