Newbie - Memory Model

I’ve been going through some of the examples and Programming Guide and I am still having a little trouble grasping what I can and cannot do with the memory model.

Can I declare an array inside a kernel? For example:

uint numbers[32];

Can you declare an array dynamically inside a kernel? For example:

uint *numbers = new uint[32];

I haven’t found any examples like this. However, some of my device functions need a bit of scratch space in their calculations. I am thinking the first example would work, but could possibly increase register usage greatly. Will this work? About how much register usage per thread should one aim for? Is there a better method?


Arrays inside a kernel are stored in local memory, which won’t increase your register count but is usually very slow. Check the manual.

And remember CUDA only has a pseudo C/C++ syntax, but is neither of those in practice. So, no heap for dynamic mallocs.

Whenever in doubt, go through the examples in the SDK. They’re very good. You’ll usually find something similar to what you’re after.

[qoute]Arrays inside a kernel are stored in local memory


That’s not always true. If you use constant indexing and array is not very big then it will remain within registers.

Thanks for the correction. Although constant indexing is not really fun. :P

To allocate dynamic shared memory – You need to use the extra shmem size kernel configuration parameter when you pass the kernel.

For example: You usually invoke kernel like:

kernel_name <<< grid, block >>> (..)

“grid” and “Block” are standard required kernel configuration parameters.

Apart from this, you can specify dynamic shared memory size as the third argument…


kernel_name <<< grid, block, 100*(sizeof(float) + sizeof(int)) >>> (...)

In the example above 100 integers and 100 floats are created as dynamic memory that corresponds to that particular invocation of that kernel.

Thus from invocation to invocation, this dynamic shard memory size could vary depending on your inputs.

Of course, this dynamic shared memory is PER block.

And, to access this inside your kernel , you have to specify like this:


 Â  extern __shared__ int dynamicMemory[];

 Â  __shared__ int *intarray;

 Â  __shared__ int *floatarray;



At the kernel’s run time, “dynamicMemory” will have its address as the “start” address of the dynamic memory in that block. Note that if you declare multiple such external declarations then all such symbols would correspond to the “start” address of the dynamic memory in that block.

So, you should do something like this at the start of your code:

if (threadIdx.x ==0 )


 Â Â intarray = (int*)(dynamicMemory);

 Â Â floatarray = (float *)(&intarray[100]);


and so on.

The CUDA programming guide has info on this. Just search for “extern” and you can reach that section…

Hope this helps.

From reading everything it seems like you need to break programs up into pieces that access very small chunks of data from global memory each kernel invocation. Doesn’t seem to be any efficient way to process chunks of even a hundred bytes or so with each kernel efficiently.

That’s definitely not the case. Many applications process tens of megabytes or more per kernel call.

In the CUDA mindset, you process large arrays with large numbers of threads in a single call. In a very simple kernel, like the dot product of two long vectors, each thread is responsible for computing just one element in the output array. There’s no need for thread-local array storage in this case because each thread only has one element to deal with.

Implementing a dot product this way in the CPU-threading world would be insane, but on GPUs, the hardware is built to facilitate this approach to problem solving. More complex kernels use the shared memory to stage a chunk of data for all the threads in the block. Textures let you access arrays in a slightly less linear fashion when necessary. But even with these tools, you still generally think about solving problems by assigning a thread to the smallest practical chunk of work you want to accomplish.


That is an excellent explanation of something I’ve also found quite confusing. Thank you very much!

:) :) :)

Then you got the wrong impression, you can work on all of global memory in your kernel in 1 call. It is just that you should not use a local variable that is an array to avoid local (slow) memory. And you can use shared memory as a fast cache if it is beneficial to your problem, but it is not necessary. Also, if you can split your problem in blocks which do not need to interact, change is that you have enough of shared memory to use as cache for all your data.

It all depends on the program you are implementing, I have kernels in which I use no shared memory at all, since I do not have a need for it.


Glad that you found it useful. Interestingly, your post on cluster programming was highly enlightening for me. Thanks.

Best Regards,


Thanks for the help guys! I think I finally have a handle on the memory model. At least, moreso than before : )