Efficient way of reading dynamic array in kernel?


 In my kernel, I am required to read a dynamically allocated array (allocated using host code). However, it so appears that a dynamic array can only be allocated in global memory ? This is very inefficient for my kernel. Any solutions ? I know we could use shared memory, but even there, I will need to know the tiled array size that I want to read beforehand, which I do not know. 

Also, is there a way I could allocate dynamic memory on my device code ? In any examples, I see that the dynamic array has been allocated using the host code only.


Only global and constant memory are accessible via the host. You can dynamically allocate (outside the kernel) memory in global and shared memory (in shared memory via a parameter to the kernel call). You can’t dynamically allocate constant memory.

You can cache reads from global memory using textures (but note that that is read only non-coherent cache).

If you access each memory location once, and can perform coalesced accesses, than directly reading global memory is about the best you can do.

Ya thats correct.

Here’s the way to dynamically allocate shared memory.








extern shared array;



This will allocate array of size on shared memory.

If you need to allocate more than 1 variables, you need to add offset to them.

Thank you for the replies.

Could you let me know in slightly more detail, how I could allocate more than one variable dynamically in shared memory ? An example would be really appreciated…

extern shared double sh_base ;

// suppose you have

// shared int int_base[50] ;

// shared float float_base[25] ;

// shared double db_base[10] ;

global void foo()


 int  *int_base = (int*) sh_base ; 

 float *float_base = (float*)&int_base[50] ;

 double *db_base = (double*)&float_base[30] ;

// why using 30?

// just want to make sure alignment requirement is satisifed


so you need to dynamically allocate 50 * sizeof(int) + 30 * sizeof(float) + 10 * sizeof(double) bytes.

Ya sure, you must see Cuda Programming Guide Pg.107 Appendix B(B.2.3) for proper uinderstanding of allocating

multiple variables in the shared memory.

The key fact is that each variable in the shared memory tries to start from the starting of the shared memory,

hence we need to separate them by adding offsets.

Any doubts, let me know !