Adjusting the 1D Stencil: can you explain a few details?

saulocpp · September 21, 2018, 8:56am

I have a moving average problem to solve and it is essentially a 1D Stencil operation. What I am doing now is adjusting it so that the operator length is variable: shorter on the beginning and end of the array, and full length after some conditions are met.

But my questions are around the shared memory operations in this particular implementation. I am pasting the code from https://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf, slides 55 and 56, which I use as a base:

__global__ void stencil_1d(int *in, int *out)
{
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
    int gindex = threadIdx.x + blockIdx.x * blockDim.x;
    int lindex = threadIdx.x + radius;

    // Read input elements into shared memory
    temp[lindex] = in[gindex];
    if (threadIdx.x < RADIUS)
        {
        temp[lindex – RADIUS] = in[gindex – RADIUS];
        temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
        }
    // Synchronize (ensure all the data is available)
    __syncthreads();

    // Apply the stencil
    int result = 0;
    for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
        result += temp[lindex + offset];

    // Store the result
    out[gindex] = result;
    }

I have already added array length checks, otherwise cuda-memcheck reports errors, so my questions are:

Line 8 saves the “center element”, while lines 11 and 12 save one element at left and right, correct?
Lines 19 and 20, as said, apply the stencil. Can you explain the logic of this loop? What it is doing, why it works?

I was expecting one specific thread to do the sum of the elements of the window in shared memory, take the average and multiply the input, to then save in output. Something like:

out[gindex] = in[gindex] * temp[AT_WINDOW_SUM_POSITION] / (RADIUS * 2);

But I don’t see anything similar to this, so I ask you to correct my understanding of it.
Any input is appreciated.

Robert_Crovella · October 8, 2018, 4:01pm

You’ll need to clearly think about the idea that each of these lines of code is being executed by possibly multiple threads. Line 8, executed by all participating threads, loads the “central” range of the data into shared memory.

Lines 11, and 12, load the left side halo region and the right side halo region. Each of line 11, 12 is being executed by multiple threads, and so together those participating threads load the halo region(s).

Once all of the data is loaded in shared memory, then each thread computes its own value (sum of values in stencil window) operating out of the data in shared memory. Since each output value computation requires a set of input values (those input values in the stencil window) the realization is a loop.

saulocpp · October 8, 2018, 8:46pm

Thanks for the feedback, Robert.
I am working on it and, after understanding a bit more the logic, I found out that my entire indexing is incorrect (even though no errors with cuda-memcheck) after I thought I got it right.
After I finish dealing with the operator at full length, I will work on other kernels to handle the beginning and end of the array. A video by NVidia on CUDA Fortran mentions that separate functions are used for these parts of the array when using the stencil. Which makes sense, as the function would become a small monster.

Topic		Replies	Views
Getting Wrong Output for 1D-Stencil Code CUDA Programming and Performance	2	239	January 20, 2024
__shared__ command - 1 D stencil algorithm CUDA Programming and Performance	0	1058	February 26, 2016
Help with cuda 2d array CUDA Programming and Performance	6	7452	September 29, 2014
do not understand thread/block division CUDA Programming and Performance	10	2799	April 23, 2012
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Branch divergence, Boundary element exchange Optimization and best practices CUDA Programming and Performance	9	18563	December 13, 2007
Newbie help on thread blocks CUDA Programming and Performance	22	10599	December 24, 2008
Problems with using shared memory CUDA Programming and Performance	7	6044	January 13, 2011
save a warp's data in memory CUDA Programming and Performance	4	437	July 2, 2019
local block thread index and global thread index using __shared__ variables ? CUDA Programming and Performance	4	2952	July 12, 2014

Adjusting the 1D Stencil: can you explain a few details?

Related topics