sm_35 introduced a new way to load data through the texture / read-only cache that is much more convenient than use of classical, bound, textures in conjunction with tex1Dfetch(). There is a new instruction for this called LDG that does not require binding of textures and has no size restrictions like tex1Dfetch(). It is accessible via the intrinsic __ldg(). The documentation of that new intrinsic was inadvertently left out of the CUDA 5.0 documentation, we are in the process of fixing this for the next CUDA release.
Use of the intrinsic __ldg() is the only way to ensure that the LDG instruction is used for a particular access. The use of the existing C/C++ qualifiers ‘const’ and ‘restrict’ facilitates the automatic generation of LDG, it does not (and cannot) guarantee it. The Kepler Tuning Guide says that, we are considering how we could make the relevant sections clearer, but as far as I can see the current description is factually correct.
‘const’ and ‘restrict’ are simply ways of conveying additional information about data objects to the compiler. In particular use of both qualifiers with a pointer says “the object pointed to by this qualified pointer is read-only when accessed through this pointer, and there are no other pointers through which it accessed (and in particular written to), within the scope of this pointer”. That last part is very important.
Since the texture cache is not coherent, safe use of LDG requires that an object is read-only for the entire life time of a kernel. Clearly this is not the same as what is asserted by the programmer through use of ‘const restrict’ except for very simple kernels, in particular those that do not call any device functions. The compiler does analysis to determine whether it is safe to use LDG, and ‘const restrict’ helps it in proving that, but there are various reasons it may not be able to prove that the use is safe (think about separately compiled device functions, for example).
Apprpopriate use of ‘const’ and ‘restrict’ is a best practice that facilitates various optimizations. One of those optimizations is the generation of the LDG instruction. Please note that by using restrict, a programmer makes an assertion / promise to the compiler about the absence of aliasing, so the programmer is responsible for making sure that assertion matches reality.
[Later:]
The __ldg() intrinsic is an overloaded function __ldg (const *T) where T is one of the following built-in types:
char, short, int, long long
unsigned chat, unsigned short, unsigned int, unsigned long long
int2, int4, uint2, uint4
float, float2, float4
double, double2
__ldg() is exported by the header file sm_35_intrinsics.h (which like other such files is auto-included by nvcc when compiling CUDA source).