Shared memory & overlapping tiles for image processing optimizing computation and maximizing dat


I want to perform convolution/morphological operators using shared memory for obvious reasons.
My operators are a bit complex, but behave like, let’s say, two 3x3 convolutions or one 5x5 convolution.
I want to implement the 5x5 convolution (for example usual 5x5 erosion) with two cascaded 3x3 convolutions
the two 3x3 convos has a smaller complexity than the 5x5. That’s the reason why I want to process data in such a way.

my original 5x5 convo looks like
Y[i][j] = X[i-2][j-2] AND X[i-2][j-1] … AND X[i-2][j+2]
X[i+2][j-2] … AND … X[i+2][j+2]

the two 3x3 conlo look like
Y[i][j] = X[i-1][j-1] AND X[i-2][j] AND X[i-1][j+1]
X[i+1][j-1] … AND … X[i+1][j+1]

but if the iteration space of the first one is (i,j) in [0…h-1][0…w-1], with h and w the tile’s size
the second one is [0-1…h-1+1][0-1…w-1+1]
Let’s assume there is an hidden offset adressing system that provide a adressing mecanism compatible with shared memory adressing (that’s 0 offset adressing: [0…h+2-1]x[0…w+2-1])

in order to save time data, before, during and after processing are stored into shared memory

my problem is how to tell to the thread block to have an iteration space on [-1…h]x[-1…w]
so processing scheme is
second step: (h+2)x(w+2) tile -> (h)x(w) tile [second convolution]
first step: (h+4)x(w+4) tile -> (h+2)x(w+2) tile [first convolution]

this processing scheme is different from the “convolution separable” paper from Nvidia, as I want to store intermediare results into memory

So, is it possible to do such a scheme with CUDA ?

Thanks in advance