averaging 9 neighbors for each pixel

Ok, please don’t laugh but I need some help to tell me if I am doing nonsense or if I am in the right path…
i am trying to average each pixel with its 8 neighbors (with shared memory) and i am using a statement like this:
shared [threadIdx.x ][threadIdx.y]= src[threadIdx.x][ threadIdx.y] +
src[threadIdx.x -1][ threadIdx.y -1]+
src[threadIdx.x -1][ threadIdx.y]+
src[threadIdx.x -1][ threadIdx.y + 1]+
src[threadIdx.x][ threadIdx.y - 1]+
src[threadIdx.x][ threadIdx.y + 1]+
src[threadIdx.x +1][ threadIdx.y - 1]+
src[threadIdx.x + 1][ threadIdx.y]+
src[threadIdx.x+ 1][ threadIdx.y +1];

I mean is it possible to calculate these calculations? I am treating the pixels as 3x3 like this:

x-1,y-1 x-1,y x-1,y+1

x ,y-1 x,y x,y+1

x+1,y-1 x+1,y x+1,y+1

Please help!

So what exactly does not work? I do not see the averaging part of the computation in the above code snippet, just summation. Other than that, it looks like it could work, assuming that you are using a two-dimensional array src to store the pixel values and use a two-dimensional thread block.

You would want to pay attention to boundary conditions (consider x=0, y= 0) to avoid out-of-bounds memory access, for example by adding padding around the region of interest. Another alternative is to handle edge cases with separate code.

well I forgot the division by 9, you are right…
but is the concept correct? would this work? I am not in front in my computer by when I get I will tell you what the compilation error tells me.
about boundary conditions, do you mean that some pixels (the borders)will have 5 neighbors and the corners only 3?

Yes, by “boundary conditions” I was referring to the fact that there is not a full set of neighbors available at edges and corners, so you will need to address this.

If there is an error message from the compiler, it should indicate pretty well what the problem is. CUDA is similar to basic C++ with a few extensions. You may want to review a general C++ reference, and the CUDA C Programming Guide for the CUDA-specific bits.

Once the program compiles, getting acquainted with the debugger may be a good idea. cuda-memcheck is another tool you would want to use to check for out-of-bound accesses and race conditions. You can also use device-side printf() to support your debugging efforts (I use this a lot myself).

I would recommend going back and forth between hand-on experimentation with the code and consulting relevant documentation to figure out how it wants to work. You may also want to check whether there is an example program among the numerous SDK examples that demonstrates techniques relevant to your work. There is probably something in the SDK that uses stencil operations, but I do not have a good overview (maybe another forum participant has a useful link).

Without knowing exactly what you are looking for, this may be way off base - but have you thought about a 2D convolution? Like njuffa said, you will likely need to consider an “apron” to pad the edges to avoid “wrap around”. I have done something similar using cufft and it was very fast.

ok can anyone give an example of how to average for example the left corner of the image which will have only 3 neighbors or the border pixel which will have 5 neighbors?
I am so confused with the indexing (threadIdx.x and threadIdx.y)

Please advice and thanks

// define the shared memory with extra 1 element (this will also help avoiding  bank conflicts)
__shared__ int src[BlockDimX+2][BlockDimY+2]

Put the neighbor elements in the shared matrix, if you are on the edge just put zeros. Add 1 to the indexes in the code in your first message.

You have some neat projects on your Github, thanks for posting…