Local memory?

The CUDA Programming Guide elaborates on shared, global and register memory, but is very thin on the properties of local memory. So I have some questions about these:

  • What are the access costs of local memory (.local instructions)

  • How much local memory is there available? (per thread or per multiprocessor)

  • Can I force variables to be in local memory to save registers?

The compiler seems very reluctant to put things in local memory, so I have the idea that it must be very slow (like global memory) or limited in some other way.

You are right. The section about local memory has gone in the latest 0.8.1 programming guide. I don’t know why. It is still in the toolkit and you can use it. The local memory now is only mentioned in the introductory section 2.3. So here is what I did to use it.

Local memory is device memory. So all the access rules for device memory apply wrt coalescing, (non)caching etc. Local memory however must have static storage like shared memory.

It is device memory. I never filled it up but I guess it shares the resources with device memory.

Use the local attribute.


Oops, prkipfer beat me to it.

From what I understand (and please correct me if I am wrong), local memory is basically the same as global memory but it is only accessible from the thread in which the variable was declared. As such, it is very slow to access and is not cached, but is only limited in size by the amount of DRAM on the GPU. A local variable can be used just like any other declared variable.

A variable can be put into local memory by using local before the type of variable. For example:

__local__ float foo;

A question for others on the board: can the address of a local variable be taken for purposes such as passing to a function or building a data structure such as a linked list?

Yes. It is device memory.


Please correct me but it seems that for “local” memory to be efficiently implemented the complier needs to add an extra dimension (index by tid) at the bottom level of all structures (for coalescing reasons - busses are only going to get wider) and that breaks C - what do you get from sizeof()?? basically a croc. So my guess is that it has been deprecated and all reference should be deleted from section 2.3 as well. Just complicates the model. Local memory really just == registers.

My understanding is that local memory is a deprecated feature. My very unofficial understanding is that it could disappear, so I wouldn’t rely on it.

That is probably right. Let’s see whether it is still in the 0.9 release.

Using the blockIdx and threadIdx, addressing device mem yourself is not hard anyway, so no big loss.