Adjusting the 1D Stencil: can you explain a few details?

I have a moving average problem to solve and it is essentially a 1D Stencil operation. What I am doing now is adjusting it so that the operator length is variable: shorter on the beginning and end of the array, and full length after some conditions are met.

But my questions are around the shared memory operations in this particular implementation. I am pasting the code from https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf, slides 55 and 56, which I use as a base:

__global__ void stencil_1d(int *in, int *out)
{
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
    int gindex = threadIdx.x + blockIdx.x * blockDim.x;
    int lindex = threadIdx.x + radius;

    // Read input elements into shared memory
    temp[lindex] = in[gindex];
    if (threadIdx.x < RADIUS)
        {
        temp[lindex – RADIUS] = in[gindex – RADIUS];
        temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
        }
    // Synchronize (ensure all the data is available)
    __syncthreads();

    // Apply the stencil
    int result = 0;
    for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
        result += temp[lindex + offset];

    // Store the result
    out[gindex] = result;
    }

I have already added array length checks, otherwise cuda-memcheck reports errors, so my questions are:

  • Line 8 saves the “center element”, while lines 11 and 12 save one element at left and right, correct?
  • Lines 19 and 20, as said, apply the stencil. Can you explain the logic of this loop? What it is doing, why it works?

I was expecting one specific thread to do the sum of the elements of the window in shared memory, take the average and multiply the input, to then save in output. Something like:

out[gindex] = in[gindex] * temp[AT_WINDOW_SUM_POSITION] / (RADIUS * 2);

But I don’t see anything similar to this, so I ask you to correct my understanding of it.
Any input is appreciated.

You’ll need to clearly think about the idea that each of these lines of code is being executed by possibly multiple threads. Line 8, executed by all participating threads, loads the “central” range of the data into shared memory.

Lines 11, and 12, load the left side halo region and the right side halo region. Each of line 11, 12 is being executed by multiple threads, and so together those participating threads load the halo region(s).

Once all of the data is loaded in shared memory, then each thread computes its own value (sum of values in stencil window) operating out of the data in shared memory. Since each output value computation requires a set of input values (those input values in the stencil window) the realization is a loop.

Thanks for the feedback, Robert.
I am working on it and, after understanding a bit more the logic, I found out that my entire indexing is incorrect (even though no errors with cuda-memcheck) after I thought I got it right.
After I finish dealing with the operator at full length, I will work on other kernels to handle the beginning and end of the array. A video by NVidia on CUDA Fortran mentions that separate functions are used for these parts of the array when using the stencil. Which makes sense, as the function would become a small monster.