How to correctly understand CUDA global memory v.s. off-chip pysical location?

Hi Forum,

I am learning the profiling guidebook Kernel Profiling Guide :: Nsight Compute Documentation and came across some questions w.r.t. the relationship between global memory v.s. off-chip memory location.

In the hardware model chapter, it says:

Global memory is a 49-bit virtual address space that is mapped to physical memory on the device, pinned system memory, or peer memory.

and in the memory chart chapter, it says:

System Memory: Off-chip system (CPU) memory
Device Memory: On-chip device (GPU) memory of the CUDA device that executes the kernel
Peer Memory: On-chip device (GPU) memory of other CUDA devices

Do those two paragraphs mean that the global memory maps to both off-chip and on-chip memory? (In one of my Cuda courses the lecture says the global or local memories are off-chip and thus they are slow, so that’s why I am confused.) And if so, does that mean not ALL global memory access is slow (since the device and peer are on-chip), only the ops that read/write to CPU system memory are slow (cudaMemSet, cudaMemCopy, cudaMemCopyAsync, etc)? Do we have an approximate quantitative delay comparison if saying the register access or shared memory access is 1 time unit?

Thank you so much for your time!

Best,
Chengzhe

Global memory is a “logical” space. That means its primary definition is from a programming perspective. In this respect, it is distinguished by asking questions like “what data is in the global space” or “what allocators create allocations in global memory”. The logical definition is part of the GPU programming model that CUDA defines.

A logical space can have data that is physically resident in, or physically “backed” by, more than one possible physical resource. System memory, device memory, and peer memory are mainly physical resources. They are actual hardware that may hold data, stored in hardware memory.

Yes, global space can map to multiple different places.

When you allocate with cudaMalloc, cudaMallocManaged, or cudaHostAlloc, all those allocations belong to the logical global space. But their physical “backing” may be different, and in the case of cudaMallocManaged, variable in time.

If we think of device memory, alternatively, we can say that it is possible for both global and local data to be physically backed in device memory.

I would not say that all global memory access is slow. For example, the caches play a role in certain global memory access. If data is in the cache, it is generally not as slow to access as if it were only in device memory.

If a register access requires 1 unit, the latency to access device memory is typically in the range of hundreds of units (100 or more). Detailed measurements are available in various benchmarking reports.

From a device code perspective, (if the accesses are originating in device code), access to system memory is also pretty long. I don’t know the exact latency, but it also has pretty low bandwidth due to the interposing (PCIE) bus.

1 Like

In addition to Robert’s coverage, a useful outline of where the device related memory heirachy sits, is here.

1 Like

Thank you Robert and rs277! Your explanations are super clear!

I have a follow-up question, on the link provided by rs277, the table says that the logical space global memory is “off-chip”, how to correctly understand the definition of “off-chip” for logical space? I mean, for physical memory space like device and peer memory, on-chip means that it was on the gpu chip, and for system pinned memory off-chip means on CPU, but how to define logical space to be off-chip, given that it can be mapped to both on-chip and off-chip physical space?

Thank you!

Best,
Chengzhe

If we ignore the caches, I suppose the physical backing for global memory is off chip. That could be one possible interpretation.

I see… Can we say that the reason why the logical global memory is “off-chip” is compared to real on-chip memory (shared, L1, registers), the data fetching delay (if L1 and L2 both miss, for example, accessing an address for the first time) is pretty high. So although global and local memory can map to both physically on-chip and off-chip memory, we still think they are “logically” off-chip?