Help me to understand Global vs Local Memory performance.


I would like to understand some things I couldn’t find in the manuals.
For example, if 256 threads (a block) are reading a 256 elements array each one (a “private” array per thread):

a) The array is in Local Memory: I understand that each thread reads its element in parallel with the other 16 threads of the same half-warp (is this true for Local Memory accesses?).
b ) The arrays (one per thread) are in Global Memory (perfect coalescing): You have to do the same number of reads than Local Memory. You read 16 elements in parallel (half-warp).

So, if the latency is the same (manuals say that), is the performance the same too?
Then, why people prefer to use global memory?

Thank you.

latency of global memory is about 500~530 cycle in Tesla C1060,

you can read volkov’s paper,…_Volkov_GPU.pdf

besides local memory is local for “scope”, it is part of global memory physically.

Hence latency of local memory is not shorter than latency of global memory.

Local memory is used for thread-local temporary storage during kernel execution. It is not persistent, nor can it be copied to or from the host.

Correction: I should say that local memory is for thread-local storage when registers cannot be used for one reason or another. The enormous register file on CUDA devices means that most thread-local variables can be stored directly in registers. Only in cases where the required amount of thread storage is very large, or local arrays with non-constant indexing, is local memory used.

Really interesting paper… Thank you.

Well, Local Memory is a part of Global Memory, but then, if accesses are always coalesced in LM, how data per thread is distributed in GM?
Do you have one “separate” (contiguous) region for each thread and then you have to do one serial access per thread? (Not coalesced accesses!), or do you have a “mix” of all threads data to get coalesced accesses? Do you understand what I mean?

These questions are due to I have a BIG OpenMP threadprivate array and I have to decide if I’m going to use LM o GM. I have only one kernel.

Thank you.

How do I make sure my data is stored in registers instead of Local Memory?

If I declare “double array[16][16]” inside my kernel or device function, is there any way to put it into registers (assuming I have enough of those per MP).

What is the size of a single register? 4 or 8 byte?

I believe the register keyword with make the compiler place a variable in registers rather than local memory, but it won’t work for your array case, because there is a limit of either 127 or 128 registers per thread, IIRC. Registers are currently 32 bit on all generations of hardware (I don’t know whether Fermi changes this or not).

I couldn’t find any info on that limit in CUDA Programming Guide. I thought we can use as many registers as available (i.e. 16k registers available per block, therefore for block of 256 threads we have 64 registers per threads; 64 threads gives us 256 registers).

I don’t believe it is discussed in the Programming Guide (although there might be something in the PTX guide), but I am pretty certain it has certainly been experimentally by someone here (maybe it was Sylvian) by disassembly of compiler output.

I’ve been able to get to 124 even though maxrregcount should be 128.

One interesting thing though is that if you are unrolling a loop mixed with your “register array” and shared memory variables you will only be able to unroll up to 63… After that you get a fatal nvcc error.

You can see the assembler code and you will find “.local” in your array if it has been placed there.

Maybe you can test what you want playing with -maxrregcount and the size of the array. Thus, you will see if “you” can place your array in registers when there are enough available or arrays are always placed in local memory.

Any comments about my previous questions? :confused:

Thank you

In terms of your OpenMP code, it really depends on how you want to decompose the code to run it on the GPU.

While it might be logical to map an OpenMP thread to a CUDA thread that isn’t necessarily the only nor the best way to do it. Depending on how much work and divergence there is in the parallel section, it might actually be better to treat a single OpenMP thread as a whole CUDA block, parallelize many of the tight inner loops or serial code sections as threads, and use shared memory to imitate your thread private array. The first and last actions of some worker threads could be to read and write to/from shared memory in a coalesced fashion, given that the final results have to go back to global memory anyway.

The only way that the compiler could turn those into registers is if you always accessed the array with constant indices. Relative indexing of registers is not possible in current NVIDIA GPUs (and very hard to get right in general so the architecture experts in the forum tell us).

The following is just my understanding. NO GUARANTEEs of correctness.

In older versions of CUDA, local mem accesses were not coalesced.

FOr example, if u used a local array of size 100 with 1000 threads in a kernel launch,

the local memory code compiled as if the local array was laid out in memory like:

Thus when thread 0 and thread 1 are accessing the first element of the array, they will be accessing addresses displaced by 100*sizeof(element). So utterly un-colaesced accesses…

So, people feared local memory…Smarter ones laid their local memory themselves in global memory in a smart way like this:

The arrangement above guarantees coalescedness… Isnt it?

So, finally, in one of the CUDA versions, NVIDIA guys themselves incorporated thsi change in their compiler , driver and whatever…



for (int i =0; i < 10; i++)




be considered as accessing with constant indecies? :geek: Or loop unrolling should do the trick? Bad for me I have some more inner loops where indeces are computed using i.

Yes, just do:

#pragma unroll

for (int i =0; i < 10; i++)




One idea might be to use template variables, this can sometimes simplify things.


Just that was what I needed to confirm!!

Thank you very much!

I’m currently playing around with placing everything into the registers, and probably found the source of the 128-registers limit.

This is the content of the $(CUDA_BIN_PATH)/


  nvcc -g -G --host-compilation=c -c -D__DEVICE_LAUCH_PARAMETERS_H__ -Xptxas -fext


typedef struct dim3_16 {

	unsigned short x, y, z;

} dim3_16;

typedef struct dim2_16 {

	unsigned short x, y;

} dim2_16;

__shared__ dim3_16  threadIdx;

__shared__ dim2_16  blockIdx;

__shared__ dim3_16  blockDim;

__shared__ dim2_16  gridDim;

int $gpu_registers[128];

I wonder what effect would changing $gpu_registers[128] have? :rolleyes:

At least you now acknowledge that the limit exists. I doubt changing the limit will do anything (or at least anything good) - you have to keep in mind that ptx is only an intermediate representation of the actual code that runs on the device. It is quite possible that downstream hardware limitations are the reason for the limit, and extra registers in the PTX beyond the limit will just either get spilled to local memory or cause something fatal.


I think the better source of information would be decuded code - at least it is finished one that is executed on the device.

You are right, there is “GPU specific maximum of 128 registers” according to the nvcc_2.3.pdf from your CUDA Toolkit doc folder.

Edit: I was abe to get 124 register used, not a single register more.