I have been confused with the terms used in memory hierarchy of GPU as I read different documents and see different architectures.
Except the unified (L1/tex) and L2 caches, what is the relation between DRAM, shared, device, global and host memories? Devic and host memories are clear, but I want to be sure about them. Can someone briefly explain that for dummies?
I forgot to say local and system memories!
Memory has two ways to view it:
- as physical resources
- as logical resources
A physical resource refers to an actual memory resource, i.e. a chip or a portion of a chip that stores things.
A logical resource refers to the programming model - what the programmer can use when they need to store things.
-
DRAM is a physical resource only. It is not on the GPU die itself, instead it is housed in DRAM chips sitting near the GPU.
-
global is a logical resource (a logical “space”). global can target data that is living in a variety of places, including caches, DRAM, and even system memory. One example of the logical global space is when you allocate data using cudaMalloc(); - the pointer to that data if used in device code will refer to the logical global space. The data itself lives in DRAM (and for example it may also be temporarily resident in a cache).
-
shared is both a logical resource and a physical resource. Logically, it is a space used/referred to when the shared decorator is used. It refers to a space that is a maximum of 48KB per threadblock(*). Physically it is a resource on the GPU die itself.
-
system refers to memory on the host (CPU). Such memory is accessible in the “global space” when for example pinned memory is used, and the pinned pointer is passed to GPU code for direct access.
-
local refers to a logical space that is per-thread. In thread code, if I create a new variable with for example int a; that variable lives in the logical local space. Each thread will have its own copy of the a variable. The physical backing for this space is roughly the same as the physical backing for the logical global space (starting with the L1 cache, ultimately ending in DRAM).
A picture is worth a thousand words. This answer has a picture lifted from the Nsight VSE user guide:
[url]cuda - nvprof option for bandwidth - Stack Overflow
Although it refers to a somewhat different programming model, the PTX guide contains a summary of various logical spaces which may aid understanding:
[url]https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#state-spaces-types-and-variables[/url]
(*) some newer GPUs can actually access more than 48KB per threadblock
Thank you very much. The picture was really helpful.
One more question. Is it meaningful to have large number of dram_read_bytes (49,000,000) and dram_write_bytes (14,000,000) while dram_utilization is low (1)?
Or that implies that dram_*_bytes are not so large? For example, M2000 has 4GB of memory and ~49MB is nothing. Do you agree with that?
49MB and 14MB. 63MB total. You’re saying those are large. I claim they may not be. But in order to understand this it’s not appropriate just to think about these numbers by themselves, but what they may imply in terms of bandwidth.
The utilization is a measure of percentage of available bandwidth that has been consumed.
Bandwidth is bytes divided by time. So what we don’t know in your question is the time period over which these read and write bytes were measured.
Roughly speaking, Quadro M2000 has about 100GB/s of available DRAM bandwidth. (just google quadro m2000 memory bandwidth). If the profiler is reporting a 1 for the utilization of it, it means that over the course of the measurement of these metrics, the average bandwidth was 10GB/s or lower.
So what minimum kernel duration would a report of 1 require, if the metric measures 59MB? That duration is just 63MB/10GB/s = 6.3ms (approximately). So if your kernel duration was 6.3ms or longer when this metric was reported, then it’s completely predictable that the profiler would report 1 for DRAM utilization.
This is how I would view it. If you come back and say “but my kernel duration was…” I probably won’t be able to respond to that. Certainly a kernel duration of longer than 5ms should also result in a 1 measurement. The profiler might also report a 1 for somewhat smaller duration of 1-5ms, for many reasons that I wouldn’t be able to explain. If your kernel was actually running in 50us, on the other hand, then a 1 measurement report by the profiler would be quite surprising, assuming your read bytes and write bytes are correct and measured in the same setting. Even then, I probably can’t explain something like that without a complete test case to work with.
Thanks for the explanation.