How to correctly understand CUDA global memory v.s. off-chip pysical location?

czxu · December 11, 2023, 6:04pm

Hi Forum,

I am learning the profiling guidebook Kernel Profiling Guide :: Nsight Compute Documentation and came across some questions w.r.t. the relationship between global memory v.s. off-chip memory location.

In the hardware model chapter, it says:

Global memory is a 49-bit virtual address space that is mapped to physical memory on the device, pinned system memory, or peer memory.

and in the memory chart chapter, it says:

System Memory: Off-chip system (CPU) memory
Device Memory: On-chip device (GPU) memory of the CUDA device that executes the kernel
Peer Memory: On-chip device (GPU) memory of other CUDA devices

Do those two paragraphs mean that the global memory maps to both off-chip and on-chip memory? (In one of my Cuda courses the lecture says the global or local memories are off-chip and thus they are slow, so that’s why I am confused.) And if so, does that mean not ALL global memory access is slow (since the device and peer are on-chip), only the ops that read/write to CPU system memory are slow (cudaMemSet, cudaMemCopy, cudaMemCopyAsync, etc)? Do we have an approximate quantitative delay comparison if saying the register access or shared memory access is 1 time unit?

Thank you so much for your time!

Best,
Chengzhe

Robert_Crovella · December 11, 2023, 6:59pm

Global memory is a “logical” space. That means its primary definition is from a programming perspective. In this respect, it is distinguished by asking questions like “what data is in the global space” or “what allocators create allocations in global memory”. The logical definition is part of the GPU programming model that CUDA defines.

A logical space can have data that is physically resident in, or physically “backed” by, more than one possible physical resource. System memory, device memory, and peer memory are mainly physical resources. They are actual hardware that may hold data, stored in hardware memory.

Yes, global space can map to multiple different places.

When you allocate with cudaMalloc, cudaMallocManaged, or cudaHostAlloc, all those allocations belong to the logical global space. But their physical “backing” may be different, and in the case of cudaMallocManaged, variable in time.

If we think of device memory, alternatively, we can say that it is possible for both global and local data to be physically backed in device memory.

I would not say that all global memory access is slow. For example, the caches play a role in certain global memory access. If data is in the cache, it is generally not as slow to access as if it were only in device memory.

If a register access requires 1 unit, the latency to access device memory is typically in the range of hundreds of units (100 or more). Detailed measurements are available in various benchmarking reports.

From a device code perspective, (if the accesses are originating in device code), access to system memory is also pretty long. I don’t know the exact latency, but it also has pretty low bandwidth due to the interposing (PCIE) bus.

rs277 · December 12, 2023, 12:18am

In addition to Robert’s coverage, a useful outline of where the device related memory heirachy sits, is here.

czxu · December 12, 2023, 5:58am

Thank you Robert and rs277! Your explanations are super clear!

I have a follow-up question, on the link provided by rs277, the table says that the logical space global memory is “off-chip”, how to correctly understand the definition of “off-chip” for logical space? I mean, for physical memory space like device and peer memory, on-chip means that it was on the gpu chip, and for system pinned memory off-chip means on CPU, but how to define logical space to be off-chip, given that it can be mapped to both on-chip and off-chip physical space?

Thank you!

Best,
Chengzhe

Robert_Crovella · December 12, 2023, 3:09pm

If we ignore the caches, I suppose the physical backing for global memory is off chip. That could be one possible interpretation.

czxu · December 12, 2023, 5:28pm

I see… Can we say that the reason why the logical global memory is “off-chip” is compared to real on-chip memory (shared, L1, registers), the data fetching delay (if L1 and L2 both miss, for example, accessing an address for the first time) is pretty high. So although global and local memory can map to both physically on-chip and off-chip memory, we still think they are “logically” off-chip?

Topic		Replies	Views
Global memory vs device memory CUDA Programming and Performance	6	5258	March 26, 2023
Global memory? Need to have Global Memory cleared up CUDA Programming and Performance	4	5011	April 19, 2007
Memory terms CUDA Programming and Performance	5	749	May 16, 2019
Memory types and CUDA access CUDA Programming and Performance	5	59357	February 3, 2009
About the different memories CUDA Programming and Performance	12	11971	December 6, 2007
Gpu Memory: Dram Or Sram? CUDA Programming and Performance	3	16649	May 25, 2012
memory organization CUDA Programming and Performance	3	4398	March 10, 2008
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22777	December 6, 2007
Understanding CUDA- simple quesions CUDA Programming and Performance	7	6172	June 12, 2009
Question About Memory Hierarchy CUDA Programming and Performance	2	1035	August 4, 2010

How to correctly understand CUDA global memory v.s. off-chip pysical location?

Related topics