Good day ladies and gentleman!
I am not too sure if I am doing things the right way here so I just wanted to ask some of the persons who know more about this stuff than me.
What I have is a big float-valued image in the device memory (~ 3000 x 2000 px). I have to convolute this image quite often with small kernels like .
What is the fastest way to achieve this?
What I would like to do is:
- load 3 rows of the image into the shared memory (if possible and not exceeding the 16 kbs of memory)
- convolute with the small kernel
- do it all again
Is this the way to do it or is there already something done like this or a better way?
Cufft is pretty slow for such kernels I think and it is also no separable kernel (like I saw in the tutorials from Nvidia)
Thx for any advices!