__shared__ command - 1 D stencil algorithm

Hello,

I am new in CUDA and I am figuring out how to solve a simple problem.

I built a kernel, that computes a 1d stencil. It works
like this:

  • input vector:

in=(0,1,2,3,4,5,6,7,8,9);

  • As output I would have (for the stencil algorithm with radius=2):

out=(0,0,10,15,20,25,30,35,0,0)

My code works fine for small size vectors!

The code for my kernel:

__global__ void stencil(float *in, float *out) {

    __shared__ float temp[N];

    int i=threadIdx.x;
    
    temp[i+r]=in[i+r];
    
    if (i<r){
        temp[N-1-i]=in[N-1-i];
        temp[i]=in[i];}
        
    __syncthreads();
    
    float sum=0.0;
    
    for (int k=-r;k<=r;k++){
        sum+=temp[i+r+k];
        
    }
    
    // --- Store the result
    out[i+r] = sum;
}

To call this I am using 1 Block and a number of threads
equal to each computed output value, that is N-2*r. In my main code
I call the kernel as:

stencil<<<1, N-2*r>>>(d_in, d_out);

My problem: how can I implement a similar code for N=1000000?
In a standard CUDA program I would increase the number of blocks. But unfortunately
in this case, I can’t do it because my algorithm is sharing memory between threads inside a block.

Thank you!