Help me to understand Global vs Local Memory performance.

dcd16043 · December 15, 2009, 8:16pm

Hello.

I would like to understand some things I couldn’t find in the manuals.
For example, if 256 threads (a block) are reading a 256 elements array each one (a “private” array per thread):

a) The array is in Local Memory: I understand that each thread reads its element in parallel with the other 16 threads of the same half-warp (is this true for Local Memory accesses?).
b ) The arrays (one per thread) are in Global Memory (perfect coalescing): You have to do the same number of reads than Local Memory. You read 16 elements in parallel (half-warp).

So, if the latency is the same (manuals say that), is the performance the same too?
Then, why people prefer to use global memory?

Thank you.

LSChien · December 16, 2009, 1:05am

latency of global memory is about 500~530 cycle in Tesla C1060,

you can read volkov’s paper, http://mc.stanford.edu/cgi-bin/images/6/65…_Volkov_GPU.pdf

besides local memory is local for “scope”, it is part of global memory physically.

Hence latency of local memory is not shorter than latency of global memory.

seibert · December 16, 2009, 2:09am

Local memory is used for thread-local temporary storage during kernel execution. It is not persistent, nor can it be copied to or from the host.

Correction: I should say that local memory is for thread-local storage when registers cannot be used for one reason or another. The enormous register file on CUDA devices means that most thread-local variables can be stored directly in registers. Only in cases where the required amount of thread storage is very large, or local arrays with non-constant indexing, is local memory used.

dcd16043 · December 16, 2009, 9:22am

Really interesting paper… Thank you.

Well, Local Memory is a part of Global Memory, but then, if accesses are always coalesced in LM, how data per thread is distributed in GM?
Do you have one “separate” (contiguous) region for each thread and then you have to do one serial access per thread? (Not coalesced accesses!), or do you have a “mix” of all threads data to get coalesced accesses? Do you understand what I mean?

These questions are due to I have a BIG OpenMP threadprivate array and I have to decide if I’m going to use LM o GM. I have only one kernel.

Thank you.

tonhead · December 16, 2009, 11:12am

How do I make sure my data is stored in registers instead of Local Memory?

If I declare “double array[16][16]” inside my kernel or device function, is there any way to put it into registers (assuming I have enough of those per MP).

What is the size of a single register? 4 or 8 byte?

avidday · December 16, 2009, 1:01pm

I believe the register keyword with make the compiler place a variable in registers rather than local memory, but it won’t work for your array case, because there is a limit of either 127 or 128 registers per thread, IIRC. Registers are currently 32 bit on all generations of hardware (I don’t know whether Fermi changes this or not).

tonhead · December 16, 2009, 1:43pm

I couldn’t find any info on that limit in CUDA Programming Guide. I thought we can use as many registers as available (i.e. 16k registers available per block, therefore for block of 256 threads we have 64 registers per threads; 64 threads gives us 256 registers).

avidday · December 16, 2009, 1:54pm

I don’t believe it is discussed in the Programming Guide (although there might be something in the PTX guide), but I am pretty certain it has certainly been experimentally by someone here (maybe it was Sylvian) by disassembly of compiler output.

Jimmy_Pettersson · December 16, 2009, 2:01pm

I’ve been able to get to 124 even though maxrregcount should be 128.

One interesting thing though is that if you are unrolling a loop mixed with your “register array” and shared memory variables you will only be able to unroll up to 63… After that you get a fatal nvcc error.

dcd16043 · December 16, 2009, 2:14pm

You can see the assembler code and you will find “.local” in your array if it has been placed there.

Maybe you can test what you want playing with -maxrregcount and the size of the array. Thus, you will see if “you” can place your array in registers when there are enough available or arrays are always placed in local memory.

Any comments about my previous questions? External Media

Thank you

avidday · December 16, 2009, 2:53pm

In terms of your OpenMP code, it really depends on how you want to decompose the code to run it on the GPU.

While it might be logical to map an OpenMP thread to a CUDA thread that isn’t necessarily the only nor the best way to do it. Depending on how much work and divergence there is in the parallel section, it might actually be better to treat a single OpenMP thread as a whole CUDA block, parallelize many of the tight inner loops or serial code sections as threads, and use shared memory to imitate your thread private array. The first and last actions of some worker threads could be to read and write to/from shared memory in a coalesced fashion, given that the final results have to go back to global memory anyway.

seibert · December 16, 2009, 3:31pm

The only way that the compiler could turn those into registers is if you always accessed the array with constant indices. Relative indexing of registers is not possible in current NVIDIA GPUs (and very hard to get right in general so the architecture experts in the forum tell us).

Sarnath · December 17, 2009, 7:12am

The following is just my understanding. NO GUARANTEEs of correctness.

In older versions of CUDA, local mem accesses were not coalesced.

FOr example, if u used a local array of size 100 with 1000 threads in a kernel launch,

the local memory code compiled as if the local array was laid out in memory like:

Thus when thread 0 and thread 1 are accessing the first element of the array, they will be accessing addresses displaced by 100*sizeof(element). So utterly un-colaesced accesses…

So, people feared local memory…Smarter ones laid their local memory themselves in global memory in a smart way like this:

The arrangement above guarantees coalescedness… Isnt it?

So, finally, in one of the CUDA versions, NVIDIA guys themselves incorporated thsi change in their compiler , driver and whatever…

HTH

tonhead · December 18, 2009, 8:29am

Can

for (int i =0; i < 10; i++)

{

array[i];

}

be considered as accessing with constant indecies? External Media Or loop unrolling should do the trick? Bad for me I have some more inner loops where indeces are computed using i.

Ian_Wainwright · December 18, 2009, 8:36am

Yes, just do:

#pragma unroll

for (int i =0; i < 10; i++)

{

array[i];

}

One idea might be to use template variables, this can sometimes simplify things.

dcd16043 · December 18, 2009, 1:49pm

The following is just my understanding. NO GUARANTEEs of correctness.

In older versions of CUDA, local mem accesses were not coalesced.

FOr example, if u used a local array of size 100 with 1000 threads in a kernel launch,

the local memory code compiled as if the local array was laid out in memory like:

Thus when thread 0 and thread 1 are accessing the first element of the array, they will be accessing addresses displaced by 100*sizeof(element). So utterly un-colaesced accesses…

So, people feared local memory…Smarter ones laid their local memory themselves in global memory in a smart way like this:

The arrangement above guarantees coalescedness… Isnt it?

So, finally, in one of the CUDA versions, NVIDIA guys themselves incorporated thsi change in their compiler , driver and whatever…

HTH

THANKS!!

Just that was what I needed to confirm!!

Thank you very much!

tonhead · December 21, 2009, 7:55am

I’m currently playing around with placing everything into the registers, and probably found the source of the 128-registers limit.

This is the content of the $(CUDA_BIN_PATH)/ptxvars.cu

/*

  nvcc ptxvars.cu -g -G --host-compilation=c -c -D__DEVICE_LAUCH_PARAMETERS_H__ -Xptxas -fext

 */

typedef struct dim3_16 {

	unsigned short x, y, z;

} dim3_16;

typedef struct dim2_16 {

	unsigned short x, y;

} dim2_16;

__shared__ dim3_16  threadIdx;

__shared__ dim2_16  blockIdx;

__shared__ dim3_16  blockDim;

__shared__ dim2_16  gridDim;

int $gpu_registers[128];

I wonder what effect would changing $gpu_registers[128] have? :rolleyes:

avidday · December 21, 2009, 8:52am

At least you now acknowledge that the limit exists. I doubt changing the limit will do anything (or at least anything good) - you have to keep in mind that ptx is only an intermediate representation of the actual code that runs on the device. It is quite possible that downstream hardware limitations are the reason for the limit, and extra registers in the PTX beyond the limit will just either get spilled to local memory or cause something fatal.

tonhead · December 21, 2009, 8:58am

External Media

I think the better source of information would be decuded code - at least it is finished one that is executed on the device.

tonhead · December 21, 2009, 2:53pm

You are right, there is “GPU specific maximum of 128 registers” according to the nvcc_2.3.pdf from your CUDA Toolkit doc folder.

Edit: I was abe to get 124 register used, not a single register more.

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
How fast is local memory? the doc doesn't say much CUDA Programming and Performance	24	8189	August 20, 2007
global memory latency CUDA Programming and Performance	12	16171	December 13, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
About the different memories CUDA Programming and Performance	12	11651	December 6, 2007
number of registers of a GPU processor 5 general purpose registers in a x86 CPU CUDA Programming and Performance	11	16034	January 7, 2009
Global memory access cost CUDA Programming and Performance	4	2931	November 18, 2017
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4987	July 3, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Registers and Shared Memory question CUDA Programming and Performance	7	5453	September 10, 2007

Help me to understand Global vs Local Memory performance.

Related topics