 # __shared__ command - 1 D stencil algorithm

Hello,

I am new in CUDA and I am figuring out how to solve a simple problem.

I built a kernel, that computes a 1d stencil. It works
like this:

• input vector:

in=(0,1,2,3,4,5,6,7,8,9);

• As output I would have (for the stencil algorithm with radius=2):

out=(0,0,10,15,20,25,30,35,0,0)

My code works fine for small size vectors!

The code for my kernel:

``````__global__ void stencil(float *in, float *out) {

__shared__ float temp[N];

temp[i+r]=in[i+r];

if (i<r){
temp[N-1-i]=in[N-1-i];
temp[i]=in[i];}

float sum=0.0;

for (int k=-r;k<=r;k++){
sum+=temp[i+r+k];

}

// --- Store the result
out[i+r] = sum;
}
``````

To call this I am using 1 Block and a number of threads
equal to each computed output value, that is N-2*r. In my main code
I call the kernel as:

``````stencil<<<1, N-2*r>>>(d_in, d_out);
``````

My problem: how can I implement a similar code for N=1000000?
In a standard CUDA program I would increase the number of blocks. But unfortunately
in this case, I can’t do it because my algorithm is sharing memory between threads inside a block.

Thank you!