Non-Separable and Non-Linear Image Filter


I’m quite new to CUDA and therefore I hope to get an advice on how to get started :)

I’d like to do some 2D image processing with CUDA.
I had a look at the different SDK examples regarding this topic and
I also read the paper “Image Convolution With Cuda” by Victor Podlozhnyuk.

My focus is on non-linear and non-separable image filters having a max radius of 16.

I was thinking off creating 16x16 thread blocks. Every thread calculates one pixel each.
During the loading stage every pixel copies data from the global memory into the shared memory.
Then the entire image processing would take place on the shared memory.

The kernel (if there’s one) is copied to the constant memory.

Image size is usually around 1024 x 1024.

My questions are:

Is this a good approach?
Is a 16x16 thread block in particular a good solution?
Should I use texture memory? Maybe in addition to shared memory?

Best Regards,