What's different between LD and LDG (load from generic memory vs. load from global memory)

I was reading << CUDA BINARY UTILITIES >>
version: DA-06762-001_v7.5 | September 2015 Application Note

I find two instructions are similar. They are LD and LDG,
LD is for loading from generic memory and LDS is for loading from global memory.

I understand what global memory mean.
But I do not what generic memory refers to here.
Does it refer all kinds of memory: const, shared, local, global ?

The difference between generic and non-generic addressing is:

Take a look at the isspacep and cvta instructions in the PTX manual.

LD is a generic load. That means the “space” must be deduced from the supplied address.

LDS is load from shared space
LDC is load from constant space
LDG is load from global space

In addition, the various load types above may have implications about the actual path they take to memory. A generic load (LD) may go through the L1, whereas a LDG may go through a different cache, depending on specific architecture and other details.

The documentation for the corresponding ptx instructions may be informative:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld

Note that specific interpretations of instructions may vary from architecture to architecture. For example, compare the description of LDG between Kepler:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#kepler

and Maxwell:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#maxwell

I believe the difference is significant. In other words, the compiler may emit LDG under different circumstances if compiling for Kepler vs. Maxwell.

Regarding the load/store instructions and the path they may take, see also the answer at http://stackoverflow.com/questions/34170310/cuda-constant-deference-to-global-memory-which-cache
Note that tex path is only for loads.

Thanks, all. Your posts are what I need.
I understand the generic load and global load now.

I believe you’ve already been given the clue to create your own example:

“If no state space is given, perform the load using generic addressing.”

So if we have a pointer like this:

__global__ void mykernel(int *data){

  int temp = *data;
}

Then the compiler is able to deduce with certainty that data must point a global space.

Likewise:

__global__ void mykernel(){

  __shared__ int data[];

  int temp = *data;

}

then the compiler is able to deduce with certainty that data must point to a shared space.

But if you do this:

__global__ void mykernel(int *data1, bool use){

  __shared__ int data2[];

  int *data3 = (use)?data1:data2;

  int temp = *data3;
}

Then the compiler is not able to deduce what space data3 will point to (at compile time) and should emit (I think) a generic load for this case. I haven’t actually tried this yet, and inspected the SASS, but I believe it should be instructive.

It’s also possible that depending on CUDA version and GPU architecture, the compiler may make different choices about whether a particular load/store will be done generically. What this means is that in some cases where you think the compiler should emit a specific operation (e.g. global) it may emit a generic operation instead. At least that is something I have witnessed.

A generic pointer is simply the CUDA-level representation of a C/C++ pointer, whose semantics are: a pointer is a pointer is a pointer.

Under the CUDA ABI, when a function is passed a pointer argument, it is always a generic pointer. This has to be so because with separate compilation, no assumptions are possible as to what memory space an incoming pointer belongs to. Clearly when the pointers are generic, the corresponding load instructions using these pointers (addresses) must be of the generic kind as well.

As a performance optimization, if program analysis proves that it is safe to do so, the compiler may convert a generic pointer into a memory-space specific pointer. The most common practically occurring case of this are pointers to shared memory space. The second most common case is probably pointers to local memory.

In addition to possible differences in the performance characteristics of different type of loads at hardware level (mentioned earlier in this thread), use of shared-memory-specific pointers also allows pointer arithmetic to be reduced to 32 bits, which reduces the number of instructions and registers required for pointer manipulations, given that GPUs are fundamentally 32-bit platforms.

These kind of optimizations are machine specific and may differ between different GPU architectures, they may also differ between CUDA version.