How to dynamically allocate shared memory? in _global__ or __device__ functions

Hello,

Could someone please tell me how to allocate shared memory in a global or device functions?

I noticed that we can allocate a block of memory with a certain size when we launch the kernel, but I did not find a way to use this block of memory in a more controlled manner. For example, in a global function, I want to allocate an array, whose size is only known in run time, in shared memory space.

Furthermore, in this global function, I want to call a device function, where it is necessary to allocate its own memory (again the size can only be determined in run time). Is it possible with the current CUDA structure?

Thank you,

I’ve met this problem. It is impossible IMU for number of reasons.
Used shared memory size should be known before run-time.

Thank you for the reply.

Suppose I do know the sizes of the arrays before I launch the kernel, and can specify the total size when I launch the kernel, e.g.:
testKernel<< grid, threads, mem_size>>

Then in the testKernel, how to tell the program that one part of the memory (mem_size) is for the first array and the other part of the memory is for the second array?

Dynamic allocation of shared mem will never be possible as it uses the register mem space (on chip). AFAIK there is no processor architecture out there that can dynamically vary the register space. So this has to be fixed when the multiprocessor starts working on the program, ie. when the block is lauched, ie. when a grid is executed.

Yes, you can allocate a chunk big enough and just use it as required. Note that the shared mem amount you specify is per block (and each block gets separate mem areas). That is why you cannot share the mem across blocks (see progamming guide). Within the block the threads share the mem and you can easily dispatch upon threadIdx.

If you need to share mem across blocks, you need to use the global mem space (on card). Note that you need to synchronize concurrent access then.

Peter

Thank you for the information.

I guess I found the answer to my question. It is in Section 4.2.2.3 of the programming guide. I will give it a try. Thanks all,

But then how do you write code that automatically adapts to different graphics cards? For example, what if I want to use 16KB shared memory on TESLA but 48KB on Fermi?

Thanks for the tip.

  • Jeff

But then how do you write code that automatically adapts to different graphics cards? For example, what if I want to use 16KB shared memory on TESLA but 48KB on Fermi?

Thanks for the tip.

  • Jeff

Section B.2.3. of the C programming guide.

Section B.2.3. of the C programming guide.