8.0 RC has new global load intrinsics with explicit cache modifiers

There are three new load intrinsics in CUDA 8.0 RC that save you from writing PTX:

  • __ldg() : ld.global.nc : load via non-coherent cache — first seen in sm_35
  • __ldca() : ld.global.ca : load and cache at all levels
  • __ldcg() : ld.global.cg : load and cache at global level (L2+ but not L1)
  • __ldcs() : ld.global.cs : load and evict first expecting to access once

I’ll guess that these intrinsics and the new ATOM/RED scope modifiers are important for interacting with “distant” GPUs over NVLink or another fabric.