Local Memory - What is that? Memory Hierarchies

At least, it was not very clear from Programming Guide to me and I guess most of CUDA beginners would have same problem as me and as poster.

I hope it is explained in clearer fashion in future manual. :)

How do you say “Local Memory” is per-thread. It is not documented in the manual, right? I just want to know what you base your answers on. Because I dont find these info in the manual. Sorry for being paranoid.

And true – the fact about indexing is OK. If I have a local array and then say array[i] – the compiler cannot generate code to do it via registers – Note that array[5] could be generated using “registers”. Only if you index via a non-constant will the compiler move it to local memory. THis part is quite clear.

Ok, Can some1 tell me where the instructions for the multi-processor are stored ??? Is there a special memory inside the multi-processor for this?

Sec 2.3 (page 10) of CUDA programming guide 1.1:
“Read-write per-thread local memory”

Sec 3.1 (page 13)
“The local and global memory spaces are implemented as read-write regions of device memory and are not cached.”

Is the 1.1 programming guide available with the 1.1 beta download? Currently only the 1.0 manual is posted on the CUDA website.

Yes.

Anderson,

The same answer was posted by “fatgarfield” in this same thread. Can you kindly look a few posts above your post? But what we decided was that “device memory” need NOT really be a “global memory”.

If it were, the manual would explicitly caution about “slowness” of this memory and would probably discourage people from using it.

What part of “not cached” fails to indicate “slowness”? It is true that the performance guidelines section 5.2.1 lacks a description of how to get the best performance out of local memory, but since you really have no control over how the compiler uses it I don’t see it as a big loss. You just have to hope that the compiler generates memory reads that are coalesced.

The truth of the matter is, if your kernel is using local memory for ANY reason it is going to be slow. Plus, it will be better for you to move that into global memory and manage it explicitly to ensure that your access pattern is coalesced. Device memory bandwidth is very precious and it should not be wasted with random uncoalesced reads. Better yet, depending on your access pattern, constant, shared, or even textures may be better options.

The programming guide uses “device” and “global” almost as synonyms. If you want to be absolutely picky about it, “global memory” refers to a pointer to “device memory” alocated by cudaMalloc or declared device (see what I mean about synonyms, you declare a “global memory” with device !, if you don’t believe me look at 5.1.2.1 in the CUDA 1.1 guide).

There are fast on-chip memories in many setups that are NOT cached. But that does NOT mean that they are slow. It depends on how close you r to the processor. Probably the local memory are so close to the ALUs that they dont need caching. It could be that way too, right?

I agree moving to the global memory part and the coalescing part. But still I would like to see an explicit mention in the CUDA manual regarding where exactly local memory is. It says the memory is accessed using “ld.local” and “st.local” instructions. I need to see a PTX manual to know what these means.

Yeah, device is a broad qualifier. You then refine it further by saying whether it is “shared” or “constant” and so on. If you dont further qualify it, it is a global memory as you say. This one is clearly stated in the manual.

Figure 3 in the following PDF may of help in understanding “Local Memory”

http://courses.ece.uiuc.edu/ece498/al1/mps/PTX_ISA_1.0.pdf

But I dont know how far this PDF is authentic

PTX_ISA_1.1.pdf concerning local state space

that should answer your question.

cheers,

Sven

From experience in my own code and reading these forums for months, a vast majority of kernels never need to use local memory. Those few that do use local memory are usually posts asking “why is this kernel so slow”. So please, let us stop bickering over semantics or what “could have been” in the hardware design. Just believe us when we tell you that it is slow. When you come across a kernel in your work that uses local memory (and will no doubt be slow), you are welcome to post it here for pointers on improving it.

Okay, I am gonna bicker again… :-)

Sven’s post indicates that Local memory is a typical memory with a “cache” on it. So, Why would it be slow??? Because – it prevents multi-processor from executing WARPs simultaneously – It is like the “If” statement based on Thread-Id that caues WARPs to diverge and re-join later…

Could be so… Just a thought… It would be gr8 if an NVIDIA person answers this local-memory thing

Sven, Thanks a lot for pointing out the “local memory” thing from the PDF.

Looks like – This PDF is installed as a part of “CUDA” Toolkit. C:\CUDA\DOC has it in my machine. I installed the CUDA kit on “C:”. So, I think the PDF is authentic.

Yeah the PDF is very much authentic…I would suggest you to go through the PTX pdf…it might give you the missing information…Good luck…

Local memory is just like global memory, but local to a thread. And it has the same latency and other properties. It can thus be considered very slow, compared to constant memory, shared memory and registers. The fast memories don’t need a cache because they are as fast as registers. In which case a cache would only be harmful.

About local memory being cached, I doubt it, there is nowhere mention of a ‘local memory cache’, and there is of the ‘constant cache’…

Please check the PTX_ISA PDF and Sven’s post (contains an excerpt) somewhere above.

Yes, I know about the PTX_ISA PDF… It is not meant as any hardware description. In one of the first pages, it already talks about a ‘virtual machine’. It is meant as a generic description of current and future NVidia computing devices. Did you notice it contains more things that aren’t actually implemented? One example is the .surface memory space. AFAIK, it does not exist for G80.

Any of the real hardware descriptions (like in the CUDA developer guide) does not mention local memory cache. So you cannot assume local memory is actually cached. Some experiments and timings have also shown that local memory is slow. Also, explicitly making things local was deprecated in 1.0. Try to stay clear from it as much as possible.

Yes, on the same lines – One cannot say that it is NOT cached OR that it resides in Global Memory.

Sure. It could be that since this memory is per-thread in nature – it would cause the WARP to do lock-step execution when they are acccessed. Thus one memory access that usually completes or stalls in one clock cycle (depending on the kind of memory) for the entire WARP , now occurs in lock-step fashion (completes or stalls) which can drastically slow down performance.

If some1 from NVIDIA talks about it – it would be great!

It does reside in global memory, I’m sure of that much. Then again, constant data and shader code also reside in global memory, that fact doesn’t tell anything about the caching scheme, that’s true.

Whats do you base your claim on?

By dumping the GPU memory. You can find the code and constants for all the kernels by reading the right (global memory) offsets, in a kernel.

Aah. Thats pretty interesting. So, I assume you did and found that out! Hmm… That sounds coool!