I have a moving average problem to solve and it is essentially a 1D Stencil operation. What I am doing now is adjusting it so that the operator length is variable: shorter on the beginning and end of the array, and full length after some conditions are met.

But my questions are around the shared memory operations in this particular implementation. I am pasting the code from https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf, slides 55 and 56, which I use as a base:

```
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS)
{
temp[lindex – RADIUS] = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
```

I have already added array length checks, otherwise cuda-memcheck reports errors, so my questions are:

- Line 8 saves the “center element”, while lines 11 and 12 save one element at left and right, correct?
- Lines 19 and 20, as said, apply the stencil. Can you explain the logic of this loop? What it is doing, why it works?

I was expecting one specific thread to do the sum of the elements of the window in shared memory, take the average and multiply the input, to then save in output. Something like:

```
out[gindex] = in[gindex] * temp[AT_WINDOW_SUM_POSITION] / (RADIUS * 2);
```

But I don’t see anything similar to this, so I ask you to correct my understanding of it.

Any input is appreciated.