What's different between LD and LDG (load from generic memory vs. load from global memory)

I was reading << CUDA BINARY UTILITIES >>
version: DA-06762-001_v7.5 | September 2015 Application Note

I find two instructions are similar. They are LD and LDG,
LD is for loading from generic memory and LDS is for loading from global memory.

I understand what global memory mean.
But I do not what generic memory refers to here.
Does it refer all kinds of memory: const, shared, local, global ?

1 Like

The difference between generic and non-generic addressing is:

Take a look at the isspacep and cvta instructions in the PTX manual.

LD is a generic load. That means the “space” must be deduced from the supplied address.

LDS is load from shared space
LDC is load from constant space
LDG is load from global space

In addition, the various load types above may have implications about the actual path they take to memory. A generic load (LD) may go through the L1, whereas a LDG may go through a different cache, depending on specific architecture and other details.

The documentation for the corresponding ptx instructions may be informative:

[url]http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld[/url]

Note that specific interpretations of instructions may vary from architecture to architecture. For example, compare the description of LDG between Kepler:

[url]http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#kepler[/url]

and Maxwell:

[url]http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#maxwell[/url]

I believe the difference is significant. In other words, the compiler may emit LDG under different circumstances if compiling for Kepler vs. Maxwell.

1 Like

Regarding the load/store instructions and the path they may take, see also the answer at c - CUDA __constant__ deference to global memory. Which cache? - Stack Overflow
Note that tex path is only for loads.

1 Like

Thanks, all. Your posts are what I need.
I understand the generic load and global load now.

1 Like

I believe you’ve already been given the clue to create your own example:

“If no state space is given, perform the load using generic addressing.”

So if we have a pointer like this:

__global__ void mykernel(int *data){

  int temp = *data;
}

Then the compiler is able to deduce with certainty that data must point a global space.

Likewise:

__global__ void mykernel(){

  __shared__ int data[];

  int temp = *data;

}

then the compiler is able to deduce with certainty that data must point to a shared space.

But if you do this:

__global__ void mykernel(int *data1, bool use){

  __shared__ int data2[];

  int *data3 = (use)?data1:data2;

  int temp = *data3;
}

Then the compiler is not able to deduce what space data3 will point to (at compile time) and should emit (I think) a generic load for this case. I haven’t actually tried this yet, and inspected the SASS, but I believe it should be instructive.

It’s also possible that depending on CUDA version and GPU architecture, the compiler may make different choices about whether a particular load/store will be done generically. What this means is that in some cases where you think the compiler should emit a specific operation (e.g. global) it may emit a generic operation instead. At least that is something I have witnessed.

1 Like

A generic pointer is simply the CUDA-level representation of a C/C++ pointer, whose semantics are: a pointer is a pointer is a pointer.

Under the CUDA ABI, when a function is passed a pointer argument, it is always a generic pointer. This has to be so because with separate compilation, no assumptions are possible as to what memory space an incoming pointer belongs to. Clearly when the pointers are generic, the corresponding load instructions using these pointers (addresses) must be of the generic kind as well.

As a performance optimization, if program analysis proves that it is safe to do so, the compiler may convert a generic pointer into a memory-space specific pointer. The most common practically occurring case of this are pointers to shared memory space. The second most common case is probably pointers to local memory.

In addition to possible differences in the performance characteristics of different type of loads at hardware level (mentioned earlier in this thread), use of shared-memory-specific pointers also allows pointer arithmetic to be reduced to 32 bits, which reduces the number of instructions and registers required for pointer manipulations, given that GPUs are fundamentally 32-bit platforms.

These kind of optimizations are machine specific and may differ between different GPU architectures, they may also differ between CUDA version.

3 Likes

Sorry to bring this old thread up. With the last example, can CUDA generic address pointer point to different address space in different threads within a warp? For example, if the ‘use’ bool value depends on thread id, so in some threads, ‘data3’ points to global memory while others point to local memory? How does this situation affect the actual data path mentioned above?

1 Like

Yes. If you modified that last example to first set use as something like:

use = (threadIdx.x&1);

then that would be the situation. The load would be dispatched to the LSU, but the handling in the LSU would diverge.

At a high level, the LSU is going to create transactions to service all needed activity from that instruction issued warp-wide, including those threads that need shared access and those threads that need global access. If you’re asking for a detailed description of what happens inside the LSU, I won’t be able to provide that. However, the shared activity would be directed to shared memory, and the global activity would be directed thru the L1/tex path already mentioned.

2 Likes

I wonder the performance difference between LD and LDS if it is guaranteed that the pointer points to shared memory. Identifying the performance difference is important as in some situations, I would write a __device__ function with a pointer parameter that is passed by a shared memory pointer. However, the shared information must be lost since I can not specify the shared property in function parameters. If LD is much slower than LDS, are there some better ways to pass shared memory pointers? (Here we assume that the __device__ function is not inline.)

1 Like

I do not know what (if any) performance differences exist with modern hardware when switching between the use of a generic pointer and a logical address space specific pointer.

If this would lead to significant performance gains, one could imagine the compiler creating address-space specific function clones where __device__ functions are not inlined and subject to separate compilation, and invoking the appropriate clone at run time. I am not aware of such functionality.

You could perform your own experiments regarding potential performance gains by using the instructions for address-space conversion provided at PTX level:

The isspacep instruction is provided to query whether a generic address falls within a particular state space window. The cvta instruction converts addresses between generic and const, global, local, or shared state spaces.

1 Like