Shared Memory Problem


I wrote a simple program to add 2 array elements and put the result into another array.I am trying to use shared memory to do the calculation faster.But i am getting this error.

sample.kernal.cuh: In function ‘void _Z6AddGPUPiS_S_i(int*, int*, int*, int)’:

            sample.kernal.cuh:21: error: ‘__vla_alloc’ was not declared in this scope

            sample.kernal.cuh:43: error: ‘__eh_curr_region’ was not declared in this scope

            sample.kernal.cuh:43: error: ‘__vla_dealloc’ was not declared in this scope

My kernel function is like this


global void AddGPU(

            int *d_ainp,

            int *d_binp,

            int *d_Cadd,

            const int ARY_N



//Thread index

const int tid = blockDim.x * blockIdx.x + threadIdx.x;

//Total number of threads in execution grid

const int THREAD_N = blockDim.x * gridDim.x;

//Shared memory for the matrix of A

shared int As[ARY_N];

// Shared memory for the matrix B

shared int Bs[ARY_N];

// // Shared memory for the matrix C

shared int Cs[ARY_N];

// Load the matrices from global memory to shared memory;

for(int i=tid;i<ARY_N;i+=THREAD_N)


As[i] = d_ainp[i];

Bs[i] = d_binp[i];


for(int ar = tid; ar< ARY_N; ar+= THREAD_N)


                Cs[ar]= As[ar]+ Bs[ar];


for (int k = tid; k < ARY_N ; k+=THREAD_N)






Here ARY_N is constant .Here ARY_N is 100000.

Can anyone help me, How can i use shared memory .And How can i copy elements from global memory to shared memory and do the calculation on the shared memory and copy the result into global memory.

Thank You in advance.


This will not even compile in C/C++. You cant define a static array of unknown size at compile time.
If you want dynamic shared memory look at the sdk for samples there.
Hint: myKernel<<< x, y, z>>>
where z is the amount of dynamic shared memory you want to use.


I think the best place to start is to look at the samples given in the CUDA SDK
and try to understand them…that would make things much more clearer !!

try the transpose example…its ont eh simplest and most straightforward to understand…

The total amount of per-block shared memory is 16kb. Leaving aside the attempt at static array dimensioning at runtime (which can’t ever work as already pointed out) - your are trying to allocate 1200kb of shared memory, which also can never work.

The best you can hope to do it coalesce the global memory reads, lose the looping and launch many threads. Shared memory will not help in this case.

I tried to use the concept of Dynamic memory .It helped a lot. Thank You.

Thank you.