__shared__ command - 1 D stencil algorithm


I am new in CUDA and I am figuring out how to solve a simple problem.

I built a kernel, that computes a 1d stencil. It works
like this:

  • input vector:


  • As output I would have (for the stencil algorithm with radius=2):


My code works fine for small size vectors!

The code for my kernel:

__global__ void stencil(float *in, float *out) {

    __shared__ float temp[N];

    int i=threadIdx.x;
    if (i<r){
    float sum=0.0;
    for (int k=-r;k<=r;k++){
    // --- Store the result
    out[i+r] = sum;

To call this I am using 1 Block and a number of threads
equal to each computed output value, that is N-2*r. In my main code
I call the kernel as:

stencil<<<1, N-2*r>>>(d_in, d_out);

My problem: how can I implement a similar code for N=1000000?
In a standard CUDA program I would increase the number of blocks. But unfortunately
in this case, I can’t do it because my algorithm is sharing memory between threads inside a block.

Thank you!