Hello,

I am new in CUDA and I am figuring out how to solve a simple problem.

I built a kernel, that computes a 1d stencil. It works

like this:

- input vector:

in=(0,1,2,3,4,5,6,7,8,9);

- As output I would have (for the stencil algorithm with radius=2):

out=(0,0,10,15,20,25,30,35,0,0)

My code works fine for small size vectors!

The code for my kernel:

```
__global__ void stencil(float *in, float *out) {
__shared__ float temp[N];
int i=threadIdx.x;
temp[i+r]=in[i+r];
if (i<r){
temp[N-1-i]=in[N-1-i];
temp[i]=in[i];}
__syncthreads();
float sum=0.0;
for (int k=-r;k<=r;k++){
sum+=temp[i+r+k];
}
// --- Store the result
out[i+r] = sum;
}
```

To call this I am using 1 Block and a number of threads

equal to each computed output value, that is N-2*r. In my main code

I call the kernel as:

```
stencil<<<1, N-2*r>>>(d_in, d_out);
```

**My problem:** how can I implement a similar code for **N=1000000**?

In a standard CUDA program I would increase the number of blocks. But unfortunately

in this case, I can’t do it because my algorithm is sharing memory between threads inside a block.

Thank you!