Where best to allocate memory On the local stack or in shared memory

Suppose I have a kernel which requires a little local memory to perform its workings. Let’s say I need a static amount of around 500 bytes.

I guess I could either allocate it on the local stack like so:

global void Compute(…)
{
float temp[128];
// Do some work…
}

Or in shared memory, like so:

global void Compute(…)
{
shared float temp[BLOCK_SIZE_X * BLOCK_SIZE_Y][128];
// Do some work…
}

Which would be best? How scarce are the different memory spaces?

No such thing as a stack on the GPU.

Your first code segment will place these in local (therefore global) memory.

If the shared memory isnt doing anything else useful, why not use it!

As for the size of memory spaces (im assuming thats what youre asking at the end), check the programming guide annexe A.1.1, shared memory is 16kB

Is that ~500 bytes per thread or per block? If per thread, it won’t fit to shared memory unless you use less than 128 threads per block. You can use 16kB of shared memory per block (minus kernel parameters - they are implicitly copied to smem). You might be forced to go with local memory but it’s a lot slower than shared. Think of smem as a managed cache, while local memory is just a fragment of global memory, which is RAM.

If that’s ~500B per block, go with shared.

Fascinating.

Since I am not using shared memory currently, I will move some of my objects there, starting with the most frequently accessed ones. If my understanding is correct, accessing them in shared memory will be quicker. I’ll make sure I don’t go over 16KB.

Thanks.

Wait, I don’t get it. This means that now instead of

global void Compute(float a, float b)
{
float c = a * b;
// etc.
}

I now have to write

#define BLOCK_SIZE X 8
#define BLOCK_SIZE Y 8

global void Compute(float a, float b)
{
shared float c[BLOCK_SIZE_Y][BLOCK_SIZE_X];
c[blockIdx.y][blockIdx.x] = a * b;
// etc.
}

which is obviously a huge pain, and doesn’t generalize well for objects with constructors.

Can this be right? It seems like the compiler could do this…

Simple variables (int, float, char etc., and built in vector types float2, int4 etc.) by default go to registers (unless you run out of hardware registers, then they will ‘spill’ to local memory)Registers are separate from shared memory and are per-thread (smem is as fast as registers if there are no access conflicts which need to be serialized).

I believe allocating small arrays, like float c[4], will also get translated to four consecutive registers. But for arrays as big as 128, the compiler will spill it to local memory.

So, it’s like this:

float c = a*b;  //compile to register operations, unless you ran out of resources
float c[4];  //compile to registers (I think)
float c[128];  //too big for registers, compile to local memory
__shared__ float c[128];  //compile to shared memory on a per-block basis: each thread sees the same array!

Oh, and about constructors - C++ is not supported in device code, only C (with some extensions like templates, but it’s not really object-oriented programming). Read through the programming guide, all of your questions so far have been answered there.

Only if you index them in a way that can be calculated at compile type. Dynamically indexed arrays will go into lmem. I’m not sure if there is a size limit or not.

Yep, the programming guide states:

You can do dynamic shared memory allocation. It’s in the Programming Guide (should have said that before…).

Does parameters of function stores in local memory or registers? If it’s simple variables, structure?

Shared mem, actually.

I would suggest to use shared memory if it is actually shared by several threads in the multiprocessor. This is the situation where you can take the most advantage of it.