Where best to allocate memory On the local stack or in shared memory

TL1 · January 22, 2009, 9:22pm

Suppose I have a kernel which requires a little local memory to perform its workings. Let’s say I need a static amount of around 500 bytes.

I guess I could either allocate it on the local stack like so:

global void Compute(…)
{
float temp[128];
// Do some work…
}

Or in shared memory, like so:

global void Compute(…)
{
shared float temp[BLOCK_SIZE_X * BLOCK_SIZE_Y][128];
// Do some work…
}

Which would be best? How scarce are the different memory spaces?

Ailleur · January 22, 2009, 9:50pm

No such thing as a stack on the GPU.

Your first code segment will place these in local (therefore global) memory.

If the shared memory isnt doing anything else useful, why not use it!

As for the size of memory spaces (im assuming thats what youre asking at the end), check the programming guide annexe A.1.1, shared memory is 16kB

_Big_Mac · January 22, 2009, 9:53pm

Is that ~500 bytes per thread or per block? If per thread, it won’t fit to shared memory unless you use less than 128 threads per block. You can use 16kB of shared memory per block (minus kernel parameters - they are implicitly copied to smem). You might be forced to go with local memory but it’s a lot slower than shared. Think of smem as a managed cache, while local memory is just a fragment of global memory, which is RAM.

If that’s ~500B per block, go with shared.

TL1 · January 23, 2009, 10:36am

Fascinating.

Since I am not using shared memory currently, I will move some of my objects there, starting with the most frequently accessed ones. If my understanding is correct, accessing them in shared memory will be quicker. I’ll make sure I don’t go over 16KB.

Thanks.

TL1 · January 23, 2009, 10:47am

Wait, I don’t get it. This means that now instead of

global void Compute(float a, float b)
{
float c = a * b;
// etc.
}

I now have to write

#define BLOCK_SIZE X 8
#define BLOCK_SIZE Y 8

global void Compute(float a, float b)
{
shared float c[BLOCK_SIZE_Y][BLOCK_SIZE_X];
c[blockIdx.y][blockIdx.x] = a * b;
// etc.
}

which is obviously a huge pain, and doesn’t generalize well for objects with constructors.

Can this be right? It seems like the compiler could do this…

_Big_Mac · January 23, 2009, 11:35am

Simple variables (int, float, char etc., and built in vector types float2, int4 etc.) by default go to registers (unless you run out of hardware registers, then they will ‘spill’ to local memory)Registers are separate from shared memory and are per-thread (smem is as fast as registers if there are no access conflicts which need to be serialized).

I believe allocating small arrays, like float c[4], will also get translated to four consecutive registers. But for arrays as big as 128, the compiler will spill it to local memory.

So, it’s like this:

float c = a*b;  //compile to register operations, unless you ran out of resources

float c[4];  //compile to registers (I think)

float c[128];  //too big for registers, compile to local memory

__shared__ float c[128];  //compile to shared memory on a per-block basis: each thread sees the same array!

Oh, and about constructors - C++ is not supported in device code, only C (with some extensions like templates, but it’s not really object-oriented programming). Read through the programming guide, all of your questions so far have been answered there.

Tigga · January 23, 2009, 12:56pm

Only if you index them in a way that can be calculated at compile type. Dynamically indexed arrays will go into lmem. I’m not sure if there is a size limit or not.

_Big_Mac · January 23, 2009, 1:58pm

Yep, the programming guide states:

Tigga · January 25, 2009, 11:23pm

You can do dynamic shared memory allocation. It’s in the Programming Guide (should have said that before…).

Wr1ght_MAG · January 26, 2009, 7:10am

Does parameters of function stores in local memory or registers? If it’s simple variables, structure?

T.B · January 26, 2009, 10:54am

Shared mem, actually.

fermath · January 26, 2009, 11:04am

I would suggest to use shared memory if it is actually shared by several threads in the multiprocessor. This is the situation where you can take the most advantage of it.

Topic		Replies	Views
Question about variables inside a kernel CUDA Programming and Performance	5	2411	January 22, 2008
temporary memory issues CUDA Programming and Performance	11	5420	March 30, 2008
using global memory as shared ? CUDA Programming and Performance	2	1264	June 25, 2012
memory confusion how big is local/shared/global memory? CUDA Programming and Performance	6	3485	January 20, 2009
How to dynamically allocate shared memory? in _global__ or __device__ functions CUDA Programming and Performance	8	27324	October 7, 2010
In registor or not the parameters defined in program CUDA Programming and Performance	8	8426	July 31, 2008
Thread Local Memory CUDA Programming and Performance	1	7107	January 26, 2016
Local memory size CUDA Programming and Performance	8	7881	November 14, 2008
Thread and block partition CUDA Programming and Performance	3	4911	December 4, 2007
Dynamic Shared memory CUDA Programming and Performance	3	6142	June 4, 2009

Where best to allocate memory On the local stack or in shared memory

Related topics