Image Convolution

Good day ladies and gentleman!

I am not too sure if I am doing things the right way here so I just wanted to ask some of the persons who know more about this stuff than me.

What I have is a big float-valued image in the device memory (~ 3000 x 2000 px). I have to convolute this image quite often with small kernels like .

What is the fastest way to achieve this?

What I would like to do is:

  • load 3 rows of the image into the shared memory (if possible and not exceeding the 16 kbs of memory)
  • convolute with the small kernel
  • do it all again

Is this the way to do it or is there already something done like this or a better way?
Cufft is pretty slow for such kernels I think and it is also no separable kernel (like I saw in the tutorials from Nvidia)

Thx for any advices!

The first convolution will need 3 rows, the second one 5 rows, and so on. This is because each convolution spreads the information from each source pixel to its neighbors. So there’s clearly a limit to what you can do with the limited shared memory in a multi-pass approach.