I was inspecting a kernel and expected to see an LDG.CI instruction.
Any hint as to how LDG.E differs from LDG and LDG.CI?
LDG.E.64 R22, [R12+-0x300];
( This is on CUDA 7.5 RC)
I was inspecting a kernel and expected to see an LDG.CI instruction.
Any hint as to how LDG.E differs from LDG and LDG.CI?
LDG.E.64 R22, [R12+-0x300];
( This is on CUDA 7.5 RC)
The E is for extended 64 bit pointers (both R12 and R13 contain the address above). Compile your code for 32 bit and the E’s should go away.
Thanks!
But I was also expecting to see some sort of constant cache indicator like .CI since the pointer is const-restrict’d and loading native types.
Later…
An explicit __ldg() does emit the .CI:
LDG.E.CI.64 R8, [R16+-0x300];
The read-only cache load instruction remains shy and elusive.
It should probably be renamed from LDG.CI to LDG.SASQUATCH.
Maybe I am confused.
The GPU’s cosntant cache is a small cache with a broadcast feature, generally used for data declared constant. I am not entirely sure whether this constant cache still exists in Maxwell, but I think it does. I am not aware that __ldg() can be used to read through the constant cache. Is there documentation that says LDG.CI performs a read through the constant cache?
LDG in Kepler was designed to read through the texture/read-only cache. “const restrict” pointers make it easier for the compiler to determine that it is safe to generate LDGs automatically without the explicit use of the __ldg() intrinsic. In Maxwell, the texture and L1 caches have been merged, so presumably LDG now reads through that combined cache.
Maxwell unifies L1 and tex.
But general global loads are now only cached in L2 (off-SMM).
Read-only loads are cached in L1 (on-SMM).
The open question is what heuristics or conditions are squelching LDG.CI.
What information is there that describes the differences between LDG and LDG.CI?
The PTX manual describes ld.global.nc and this appears to always map to LDG.CI.
In some cases, you can opt-in to have general global loads cached in “L1” on Maxwell.
http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#l1-cache
which may affect code generation
I’ve noticed the same inconsistencies and I typically use explicit __ldg() loads when I want cache incoherent access (CI). LDG.CI is particularly helpful if you have any kind of strided access pattern. For example, I’ve seen transpose written the naive way (but with __ldg) run just as fast as an efficient shared memory implementation. Though this only works for smaller matrix sizes. At some point you saturate the texture cache and the shared memory implementation starts running much faster.
I guess you are referring to this:
“In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.”
I do not see how this connects to LDG vs LDG.CI. Seems to me that the constantly changing nature of the GPU cache hierarchies plus NVIDIA’s tight-lipped approach to documenting them is becoming really confusing.
In any event, are we in agreement that none of these LDG flavors has anything to do with the constant cache, which comes into play when loading data via LDC?
I think the confusion comes from the fact that the documentation says memory needs to be declared as “const restrict” for this instruction to be used.
But I agree on the poorly documented cache hierarchy. I was talking to an nvidia hardware engineer at GTC this year and he claims writable L1 is now back in Maxwell. So maybe that means register spills are faster now? I wouldn’t know as I almost never write code that spills registers. One of these days I’ll get around to writing a probe of caching behavior but I think I understand it well enough for the limited ways in which I currently rely on it.
Yes, we’re in agreement and only talking about the unified L1/tex cache on Maxwell and not the read-only constant cache. The docs are a little loose and also refer to the L1/tex cache as a “read-only data cache”.
I’ve come full circle in my large CUDA codebase and am returning to explicit __ldg() calls.
In order to use LDG, which is a load through a non-coherent cache, two conditions have to be met to preserve C/C++ semantics:
(1) The data object accessed must be read only, for the entire duration of the kernel.
(2) No writable data object may be aliased, in whole or partially, to the read-only object, anywhere in the kernel.
For very simple kernels, the compiler may be able to prove that these conditions hold without any help from the programmer. For slightly more complicated kernels, the use of ‘const’ helps the compiler to establish (1) and the use of ‘restrict’ helps the compiler establish (2). It does not guarantee that LDG will be generated.
For complicated kernels, even this help may not suffice, mostly because ‘const’ and ‘restrict’ assert local properties but use of LDG requires global properties valid across the totality of code in the kernel. The CUDA compiler’s aggressive inlining may also help in establishing the desired global property. Since it may be impossible for the compiler to prove that it is safe to use LDG there is an __ldg() intrinsic with which the programmer can force the use of LDG.
As a historic side note, the use of ‘const’ and ‘restrict’ also helps the compiler establish that it is safe to use the LDU (load uniform) instruction on sm_2x platforms. If I recall correctly, LDUs go through the constant cache, making use of the broadcast feature. In order to establish uniformity, the compiler must also prove that the addresses are not dependent on thread index.
Hmm… the info I got is that, unlike Kepler, Maxwell LMEM/spill loads are cached and stores are not.
You wind up with an LMEM load miss after every store.
Spills, locals and doubles are bugs! :)
-Xptxas=-v,-warn-spills,-warn-lmem-usage,-warn-double-usage
(OK, doubles aren’t bugs. Just trolling…)
I will not go looking for chapter & verse now, but as I recall, in sm_30 and sm_35, LMEM loads go through the L1 cache, while global loads never do. For sm_37, global loads may also go through the L1. The sm_30 and sm_35 behavior makes sense given how tiny the L1 cache is, and how critical spill performance can be. One could say that in Kepler the L1 cache acts as victim cache for the register file.
Other than the L1 cache being merged with the texture cache into one physical structure in Maxwell, I am not aware of behavioral changes, which does not mean they don’t exist.
I really think NVIDIA should rethink their strategy of handing out the barest minimum of information on the cache hierarchies of currently supported GPUs. It is close to impossible to form a consistent mental model of the hardware’s behavior in support of optimizations efforts.
@njuffa, thanks for the summary.
One problem with using const restrict as a hint to use the L1/tex load path is that these qualifiers are already part of general code performance and hygiene. [const] [restrict]
One nagging concern is that proper decoration of pointers with const restrict might result in excessive traffic through the L1/tex cache thus reducing its hit rate.
What about when I want to use the qualifiers but don’t want the read-only data cache path to be used?
There are global ptxas caching switches (-dlcm/-flcm) but using them is admitting defeat. :)
The good news is that if you’re concerned about LDG.CI then you’ve probably already heavily optimized your kernel.
Use of ‘const’ and ‘restrict’ is a best practice independent of any optimization aspects and CPU or GPU architecture. Their use should not be construed as a particular “hint” to generate LDG. It is not.
These qualifiers simply provide additional information to the compiler. That is all. The compiler may or may not be able to put that information to good use. In the balance, providing more information to the compiler is always a good thing. Even if it cannot take advantage of that information today it may be able to do so in future versions.
The use of ‘restrict’ allows the compiler to perform various optimizations that are not possible in the context of aliasing. One such optimization is much more aggressive load re-ordering, another such optimization is the generation of LDG.
Historically, ‘restrict’ was added by the C99 folks to make C performance competitive with Fortran, which has had a general no-aliasing requirement since Fortran 66 (other than for explicit aliasing with EQUIVALENCE) and therefore could optimize number-crunching code better. Why the C++ folks did not adopt it, I do not know. As far as I know it is not even part of C++11. But most C++ compilers offer it as an extension, thus ‘restrict’, meaning the symbol is in the compiler’s rather than the global namespace.
C/C++ HLL features are not designed for low-level control of a processor’s cache hierarchy. I have not checked recently, but how do compilers for x86 provide that control outside the use of inline assembly and intrinsics, the mechanisms available to CUDA programmers?
As an alternative to plastering __ldg() calls all over your code, have you tried the --restrict switch of nvcc? I have not tried it, but my expectation would be that it is similar to the “assume no aliasing” switches available on other compilers. Of course, that global “no aliasing” assertion better be true, or code will break.
Based on experiments I performed on various Kepler platforms, I would say that it is possible, but not likely, that the use of ‘const’ plus ‘_restrict’ will lead to lower performance. In the few cases where I saw the performance drop slightly the reason was typically register pressure increase due to increased load re-ordering and load batching, not increased cache utilization.
I haven’t tried the “–restrict” switch. I’ll give a shot soon.
This is actually mentioned in the documentation:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#restrict
@allanmc: That’s a good idea to only selectively use CI where needed. I think that might speed up my elementwise ops some. And maybe the gemm variants like NN where only one of the inputs has strided access. I remember trying this a while back and didn’t see any performance benefit. But I don’t think I was applying it cleverly enough at the time.