Maxwell (sm_50) instruction: LDG.E ?

allanmac · August 14, 2015, 11:57pm

I was inspecting a kernel and expected to see an LDG.CI instruction.

Any hint as to how LDG.E differs from LDG and LDG.CI?

LDG.E.64 R22, [R12+-0x300];

( This is on CUDA 7.5 RC)

scottgray · August 15, 2015, 1:23am

The E is for extended 64 bit pointers (both R12 and R13 contain the address above). Compile your code for 32 bit and the E’s should go away.

allanmac · August 15, 2015, 4:08am

Thanks!

But I was also expecting to see some sort of constant cache indicator like .CI since the pointer is const-restrict’d and loading native types.

Later…

An explicit __ldg() does emit the .CI:

LDG.E.CI.64 R8, [R16+-0x300];

The read-only cache load instruction remains shy and elusive.

It should probably be renamed from LDG.CI to LDG.SASQUATCH.

njuffa · August 15, 2015, 7:49am

Maybe I am confused.

The GPU’s cosntant cache is a small cache with a broadcast feature, generally used for data declared constant. I am not entirely sure whether this constant cache still exists in Maxwell, but I think it does. I am not aware that __ldg() can be used to read through the constant cache. Is there documentation that says LDG.CI performs a read through the constant cache?

LDG in Kepler was designed to read through the texture/read-only cache. “const restrict” pointers make it easier for the compiler to determine that it is safe to generate LDGs automatically without the explicit use of the __ldg() intrinsic. In Maxwell, the texture and L1 caches have been merged, so presumably LDG now reads through that combined cache.

allanmac · August 15, 2015, 2:33pm

Maxwell unifies L1 and tex.

But general global loads are now only cached in L2 (off-SMM).

Read-only loads are cached in L1 (on-SMM).

The open question is what heuristics or conditions are squelching LDG.CI.

njuffa · August 15, 2015, 4:25pm

What information is there that describes the differences between LDG and LDG.CI?

allanmac · August 15, 2015, 5:01pm

The PTX manual describes ld.global.nc and this appears to always map to LDG.CI.

Robert_Crovella · August 15, 2015, 5:33pm

In some cases, you can opt-in to have general global loads cached in “L1” on Maxwell.

http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#l1-cache

which may affect code generation

scottgray · August 15, 2015, 5:39pm

I’ve noticed the same inconsistencies and I typically use explicit __ldg() loads when I want cache incoherent access (CI). LDG.CI is particularly helpful if you have any kind of strided access pattern. For example, I’ve seen transpose written the naive way (but with __ldg) run just as fast as an efficient shared memory implementation. Though this only works for smaller matrix sizes. At some point you saturate the texture cache and the shared memory implementation starts running much faster.

njuffa · August 15, 2015, 5:43pm

I guess you are referring to this:

“In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.”

I do not see how this connects to LDG vs LDG.CI. Seems to me that the constantly changing nature of the GPU cache hierarchies plus NVIDIA’s tight-lipped approach to documenting them is becoming really confusing.

In any event, are we in agreement that none of these LDG flavors has anything to do with the constant cache, which comes into play when loading data via LDC?

scottgray · August 15, 2015, 5:59pm

I think the confusion comes from the fact that the documentation says memory needs to be declared as “const restrict” for this instruction to be used.

But I agree on the poorly documented cache hierarchy. I was talking to an nvidia hardware engineer at GTC this year and he claims writable L1 is now back in Maxwell. So maybe that means register spills are faster now? I wouldn’t know as I almost never write code that spills registers. One of these days I’ll get around to writing a probe of caching behavior but I think I understand it well enough for the limited ways in which I currently rely on it.

allanmac · August 15, 2015, 6:18pm

Yes, we’re in agreement and only talking about the unified L1/tex cache on Maxwell and not the read-only constant cache. The docs are a little loose and also refer to the L1/tex cache as a “read-only data cache”.

I’ve come full circle in my large CUDA codebase and am returning to explicit __ldg() calls.

njuffa · August 15, 2015, 6:24pm

In order to use LDG, which is a load through a non-coherent cache, two conditions have to be met to preserve C/C++ semantics:

(1) The data object accessed must be read only, for the entire duration of the kernel.
(2) No writable data object may be aliased, in whole or partially, to the read-only object, anywhere in the kernel.

For very simple kernels, the compiler may be able to prove that these conditions hold without any help from the programmer. For slightly more complicated kernels, the use of ‘const’ helps the compiler to establish (1) and the use of ‘restrict’ helps the compiler establish (2). It does not guarantee that LDG will be generated.

For complicated kernels, even this help may not suffice, mostly because ‘const’ and ‘restrict’ assert local properties but use of LDG requires global properties valid across the totality of code in the kernel. The CUDA compiler’s aggressive inlining may also help in establishing the desired global property. Since it may be impossible for the compiler to prove that it is safe to use LDG there is an __ldg() intrinsic with which the programmer can force the use of LDG.

As a historic side note, the use of ‘const’ and ‘restrict’ also helps the compiler establish that it is safe to use the LDU (load uniform) instruction on sm_2x platforms. If I recall correctly, LDUs go through the constant cache, making use of the broadcast feature. In order to establish uniformity, the compiler must also prove that the addresses are not dependent on thread index.

allanmac · August 15, 2015, 6:26pm

Hmm… the info I got is that, unlike Kepler, Maxwell LMEM/spill loads are cached and stores are not.

You wind up with an LMEM load miss after every store.

Spills, locals and doubles are bugs! :)

-Xptxas=-v,-warn-spills,-warn-lmem-usage,-warn-double-usage

(OK, doubles aren’t bugs. Just trolling…)

njuffa · August 15, 2015, 6:52pm

I will not go looking for chapter & verse now, but as I recall, in sm_30 and sm_35, LMEM loads go through the L1 cache, while global loads never do. For sm_37, global loads may also go through the L1. The sm_30 and sm_35 behavior makes sense given how tiny the L1 cache is, and how critical spill performance can be. One could say that in Kepler the L1 cache acts as victim cache for the register file.

Other than the L1 cache being merged with the texture cache into one physical structure in Maxwell, I am not aware of behavioral changes, which does not mean they don’t exist.

I really think NVIDIA should rethink their strategy of handing out the barest minimum of information on the cache hierarchies of currently supported GPUs. It is close to impossible to form a consistent mental model of the hardware’s behavior in support of optimizations efforts.

allanmac · August 15, 2015, 6:54pm

@njuffa, thanks for the summary.

One problem with using const restrict as a hint to use the L1/tex load path is that these qualifiers are already part of general code performance and hygiene. [const] [restrict]

One nagging concern is that proper decoration of pointers with const restrict might result in excessive traffic through the L1/tex cache thus reducing its hit rate.

What about when I want to use the qualifiers but don’t want the read-only data cache path to be used?

There are global ptxas caching switches (-dlcm/-flcm) but using them is admitting defeat. :)

The good news is that if you’re concerned about LDG.CI then you’ve probably already heavily optimized your kernel.

njuffa · August 15, 2015, 7:24pm

Use of ‘const’ and ‘restrict’ is a best practice independent of any optimization aspects and CPU or GPU architecture. Their use should not be construed as a particular “hint” to generate LDG. It is not.

These qualifiers simply provide additional information to the compiler. That is all. The compiler may or may not be able to put that information to good use. In the balance, providing more information to the compiler is always a good thing. Even if it cannot take advantage of that information today it may be able to do so in future versions.

The use of ‘restrict’ allows the compiler to perform various optimizations that are not possible in the context of aliasing. One such optimization is much more aggressive load re-ordering, another such optimization is the generation of LDG.

Historically, ‘restrict’ was added by the C99 folks to make C performance competitive with Fortran, which has had a general no-aliasing requirement since Fortran 66 (other than for explicit aliasing with EQUIVALENCE) and therefore could optimize number-crunching code better. Why the C++ folks did not adopt it, I do not know. As far as I know it is not even part of C++11. But most C++ compilers offer it as an extension, thus ‘restrict’, meaning the symbol is in the compiler’s rather than the global namespace.

C/C++ HLL features are not designed for low-level control of a processor’s cache hierarchy. I have not checked recently, but how do compilers for x86 provide that control outside the use of inline assembly and intrinsics, the mechanisms available to CUDA programmers?

As an alternative to plastering __ldg() calls all over your code, have you tried the --restrict switch of nvcc? I have not tried it, but my expectation would be that it is similar to the “assume no aliasing” switches available on other compilers. Of course, that global “no aliasing” assertion better be true, or code will break.

Based on experiments I performed on various Kepler platforms, I would say that it is possible, but not likely, that the use of ‘const’ plus ‘_restrict’ will lead to lower performance. In the few cases where I saw the performance drop slightly the reason was typically register pressure increase due to increased load re-ordering and load batching, not increased cache utilization.

allanmac · August 15, 2015, 7:28pm

I haven’t tried the “–restrict” switch. I’ll give a shot soon.

Robert_Crovella · August 15, 2015, 7:58pm

This is actually mentioned in the documentation:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#restrict

scottgray · August 15, 2015, 8:37pm

@allanmc: That’s a good idea to only selectively use CI where needed. I think that might speed up my elementwise ops some. And maybe the gemm variants like NN where only one of the inputs has strided access. I remember trying this a while back and didn’t see any performance benefit. But I don’t think I was applying it cleverly enough at the time.

Topic		Replies	Views
Why L1 cache hit ratio become zero on K20? CUDA Programming and Performance	10	5625	January 17, 2013
Tuning a kernel with LDG(ON/OFF,array) and prefetching CUDA Programming and Performance	12	19365	June 3, 2020
How does cuda global memory's L1 caching work CUDA Programming and Performance	5	641	July 12, 2024
What's different between LD and LDG (load from generic memory vs. load from global memory) CUDA Programming and Performance	10	10818	March 13, 2022
__constant__ memory in function scope CUDA Programming and Performance	13	4595	June 1, 2015
So what's new about Maxwell? CUDA Programming and Performance	166	55910	March 10, 2015
Do 7.x devices have a readonly constant cache? CUDA Programming and Performance	4	1564	July 30, 2022
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23521	March 21, 2011
Do const __restrict__ pointers ever generate LDG.CI loads on CUDA 7? CUDA Programming and Performance	9	3764	March 5, 2015
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26917	December 23, 2014

Maxwell (sm_50) instruction: LDG.E ?

Related topics