Hi everyone, I’m recently keen to the usage of Nsight Compute to help profile my CUDA programming, but I met with some issues on its “memory work load analysis”:
1.What’s the difference between “global memory” and “device memory”? I’ve checked Nsight Compute kernel profiling guide, and I’ve learnt that global memory is a logic concept, and it must have a corresponding physical unit, which I thought was “device memory”. BUT as depicted in the kernel profiling guide, “Device Memory is On-chip device (GPU) memory of the CUDA device that executes the kernel”. I think global memory is off-chip. I experiment with a kernel in the Nsight Compute and I got the following result. As you can see, data transportation between L1/TEX and L2 is quite different from transportation between L2 and Device Memory. Besides, the former one argues a severe uncoalesced memory access, while the latter doesn’t.
It’s a real bummer as I did not found valuable references. Could any me help me clarify these?
I would have said there’s no difference, if the definition of “On-Chip”, means resident on the GPU die. The definitions in the Nsight Compute documentation seem ambiguous. As well as the Device Memory quote you posted, “Peer Memory” is also stated to be “On-chip” memory of other devices.
Maybe “Device DRAM” would be a better description.
The “Best Practices Guide” depicts the memory heirachy here, which gives a clear indication of “On-chip” and “DRAM”.
Device memory means the DRAM attached to a GPU. The memory that is accessed over the GPU external memory bus. It can be thought of as a “physical” space.
“global memory” is a logical space. It is the memory you get when you do a host cudaMalloc operation, a device malloc operation, or a cudaHostAlloc operation. All 3 of these types of allocations live in the logical global space. “global” memory is distinguished from the other common logical spaces, “local”, “shared”, and “constant”.
Note that host memory, if allocated via a pinned allocator such as cudaHostAlloc lives in the logical global space, and is “global memory”.
Device memory is also one of the possible backings for the logical “local” space. Device memory can be off-chip in the case of discrete GPUs, and it can be “on-chip” in the case of Jetson devices. I personally don’t think on-chip vs. off-chip is a defining distinction for device memory. It can reside on-chip, and it can reside off-chip.
The 3 entities on the right hand side of that diagram could be thought of as physical backings. The entities all the way on the left can be thought of as distinct logical spaces.
Thank you for clarifying. Maybe I present my understanding on your reply?
Global memory is a logical space. It may correspond to physical memory backings in the device and host.
Device memory means the DRAM attached to the GPU. It could be the physical backing for “global memory”, as well as other logical memory spaces such as “logical memory”, “constant memory”, “texture memory”, and “surface memory”.
Technically, global memory and device memory is not equivalent.
It would be great if I take it correctly. Thank you again.