From reading the CUDA programming guide and CUDA SC07 “tutorial” I see references to various types/classes of memory. I have a couple of questions related to understanding the use of what I think are four acknowledged memory types: ‘shared’, ‘constants’, ‘texture’, and ‘global’:
– Does cudaMemcpy() copy memory to/from “shared” device memory, i.e. it can not be used to reference ‘global’, ‘constant’, or ‘texture’ memory?
– Does cudaMemcpyToSymbol() copy memory to ‘global’ and ‘constant’ memory locations, i.e. it is mutually exclusive with cudaMemcpy() above?
– Is ‘device memory’ referenced in the text the same as the ‘global memory’ in Fig 3-1 of the CUDA Programming Guide 2.0 ?
– What is “local” memory, which is referenced in the text but does not show in the figures. It seems to share huge latency problems with
‘global’ memory?
– Which of these memory types is “on chip” and which is “off chip”?
– What is the latency / BW hit taken in loading/storing each of the four memory types from the device?
Memcpy & MemcpyToSymbol copies memory between global constant memory addresses only. (as best I can tell, constant memory is actually global memory - but has some special cache/processor assigned to it - but that’s another story for another time.) (note: you can’t actually get an address to constant memory, thus you use MemcpyToSymbol - but MemcpyToSymbol can also copy to global if I’m not mistaken).
when ‘device’ is referenced, it means the CUDA device (eg: video card), when host is referenced - it means the host OS/device (eg: windows/cpu) - so yes, device memory = global memory and/or constant memory.
local memory is global memory, but local memory tends to refer to the event where you run out of registers, thus CUDA starts storing ‘registers’ in global memory (but in this case, it’s called local memory). (correct me of I’m wrong about this)
I don’t think ‘texture memory’ as such exists (but I could be wrong), as best I can tell TUs (texture units) reference global memory, but if you use a texture - it has an internal cache and smarter scheduling that helps reduce latency with texture reads under some circumstances…
shared memory is the only memory type which is ‘on chip’ in the sense that it resides on the same IC as the SPs/TUs/ROPs/etc (eg: in the GPU), where a global/constant memory are on separated DRAM modules on the same PCB as the GPU. Shared memory is more or less equivilent to L2 cache on a CPU, except you have manual control over reads/writes to it.
latency is discussed in the programming guide under each memory types topic (but all you really need to know is shared memory has the lowest latency if addressed correctly, it can be as fast as accessing a register, global memory is the slowest - but most plentiful).
Bandwidth is only consumed (i think) when loading/storing to global or constant memory, or texture memory.
Note: Some of what I said above might be incorrect/misleading (was just a quick/rough write-up for you), so please correct me of I’m wrong.
After reading your note and going back to the programming guide I think much of what you wrote is dead on. “Constants” and “Texture” memory are apparently regions of “Device” memory that have associated on-chip cache. “Global” memory is probably the largest chunk of what the 3-1 Figure in the programming guide calls “Device” memory, the remaining bits being “Constants” and “Texture” memory. Note, the term “Device” memory here is not in contrast to “Host” memory, but is a documentation label put on that GPU memory that is not the “Shared” memory or registers on chip but is addressable memory physically on the GPU PCB. That would put all “Global” memory and cache misses to “Constants” and “Texture” memory about 400-600 clock cycles from registers, from what I’m reading.
I have not figured out if you are correct in your interpretation of “local” memory. It seems to be more than simply register spill space.
The CUDA reference manual says that cudaCpyMemToSymbol() can move to either “Global” or “Constants” memory.
So you suggest cudaMemCpy() moves to/from “Global” memory and not “Shared” memory? In which case how do I get host data into shared memory? Is cudaMalloc() also allocating “Global” memory? " shared float myArray[itsSize]; " would seem to be the equivalent of a CPU’s stack space allocation, but this fails to be useful when the size of a “Shared” memory array is only known at run time.
Shared memory is populated purely from within CUDA kernels themselves - and not by the host.
There are two ways of declaring,
outside of a kernel function, using either the Runtime or Driver API via the cudaConfigureCall / cuFuncSetSharedSize functions
inside the kernel itself, using static smem allocation via the shared keyword. (eg: “shared float shared_memory[1024];”)
Note: Method #1 allocates shared memory ‘dynamically’, and is in addition to method #2 (so allocating 100 bytes via method #1, and having 4096 bytes allocated via method #2 will result in 4196 bytes of smem allocated per block in total).
Shared memory is strictly ‘populated’ inside CUDA kernels though, you cannot copy memory from gmem to smem outside of a kernel - most probably because smem isn’t allocated until the kernel starts running - and there are strict access requirements for high performance smem operations, which rely on the ordering of thread loads/stores (similar in concept, but different rules - to global memory coalescing).