Dynamic Shared memory


I have 3 shared memory array in my kernel function as follows.

extern shared int As; //Shared memory for array A

extern shared int Bs; // Shared memory for array B

extern shared int Cs; //Shared memory for array C

I am having 3 global memory array,each having size 100000.I need to copy

all the elments of A to AS and B to BS.And i want to do the calculation in shared memory to reduce the access time.So i wrote a code as follows.


global void AddGPU(

            int *d_ainp,

            int *d_binp,

            int *d_Cadd,

            int ARY_N



const int tid = blockDim.x * blockIdx.x + threadIdx.x;//Thread index

const int THREAD_N = blockDim.x * gridDim.x; //Total number of threads in execution grid

extern shared int As;//Shared memory for array A

extern shared int Bs;// Shared memory for array B

extern shared int Cs;//Shared memory for array C

////////////////// Copy the arrayelements from global memory to shared memory;////////////

for(int i=tid;i<ARY_N;i+=THREAD_N)


As[i] = d_ainp[i];

Bs[i] = d_binp[i];


/////////////////Do the addition of arrays in shared memory and puting the result into shared memory.///

for(int ar = tid; ar< ARY_N; ar+= THREAD_N)


                Cs[ar]= As[ar]+ Bs[ar];


/////////////////Copying the result to global memory./////////////////////////////////

for (int k = tid; k < ARY_N ; k+=THREAD_N)






I am calling the kernel function as follows.

AddGPU<<<<<<Dg, Db,sharedMemSize>>>(






Here i want to take the reading with different number of blocks and different number of threads.

I am confused with how to determine the size of shared memory. And how much memory is allocated to each of the 3 arrays.Can anyone help me to make my code perfect.

Hm… seems you created two threads for same or simmilar problem.

First of all, all extern shared arrays will point to the same part of memory. This kinda sucks but it is how it is being done. There are two options:

a) use only one array and access it in a clumsy way:

array[i] for A

array[i+ARY_N] for B

array[i+ARY_N*2] for C

B) Define a pointer

__extern__ int array[];


int *As=&array[0];

int *Bs=&array[ARY_N];

int *Cs=&array[ARY_N*2];

Secondly, in the launch configuration you define size of shared memory per block. So I would guess you want to call

AddGPU<<<<<<Dg, Db,ARY_N*3*sizeof(int)>>>(...)

Thirdly, remember you can allocate maximum of 16KB of shared memory. If your ARY_N is too big, your kernel will fail to launch.

Finally, for such simple computation as in your example, you don’t need to use the shared array. Shared memory is useful if you want to access the element several times, and usually from different threads.

Note that your algorithm has to:

  • load data from global memory

  • make the addition

  • store data to global memory

and no matter if you use shared memory or not, you still have to perform those 3 steps!

You can’t copy all the elements of those global memory arrays to shared memory because all three of those shared memory arrays must occupy the same 16kb pool of shared memory per block. Irrespective of that, using shared memory in the way you are proposing isn’t going to improve the execution speed of a trivial kernel like this one.

If you try to to addition array A+ array B and store in array C? I dont think you need to use share memory.

If you use another purpose, you should pay attention at the “Nvidia_cuda_programming guide ->Section” to get more information about shared memory. I have a suggestion

assum that the maximum elements can be strored in a shared memory array is N=sharedMemSize/sizeof(int).

1> copy N elements from global memroy to shared memoryA->As, B->Bs

2> try to calculate some thing with data allocated in As and Bs.

3> after calculated, write back result to C (global memory) in this case you don’t need to write data to Cs.

4> again, copy data from A to As with offset is N elements (same for Bs)

and continue calculate…