Hi Forum,
I am learning the profiling guidebook Kernel Profiling Guide :: Nsight Compute Documentation and came across some questions w.r.t. the relationship between global memory v.s. off-chip memory location.
In the hardware model chapter, it says:
Global memory is a 49-bit virtual address space that is mapped to physical memory on the device, pinned system memory, or peer memory.
and in the memory chart chapter, it says:
System Memory: Off-chip system (CPU) memory
Device Memory: On-chip device (GPU) memory of the CUDA device that executes the kernel
Peer Memory: On-chip device (GPU) memory of other CUDA devices
Do those two paragraphs mean that the global memory maps to both off-chip and on-chip memory? (In one of my Cuda courses the lecture says the global or local memories are off-chip and thus they are slow, so that’s why I am confused.) And if so, does that mean not ALL global memory access is slow (since the device and peer are on-chip), only the ops that read/write to CPU system memory are slow (cudaMemSet, cudaMemCopy, cudaMemCopyAsync, etc)? Do we have an approximate quantitative delay comparison if saying the register access or shared memory access is 1 time unit?
Thank you so much for your time!
Best,
Chengzhe