2D CUDA convolution

Do you have patience to answer an novice?

I need to convolve a kernel (10x10 float ) over many 2K x 2K images (float). Is there something already in the cuBLAS or cuFFT (for cuFFT I assume I would have to convert the image and the kernel to Fourier space first) for doing this? (Let’s assume I can’t use openCV unless it is to copy the source)

Or should I roll my own along the lines of: https://www.evl.uic.edu/sjames/cs525/final.html

It might be that I can get by with a smaller kernel. Yes it is separable, For the moment our notion is that it is Gaussian, but a match kernel is probably even more to our liking. (we are looking for small star like objects in a black field).

One image processing guy suggested first creating a integral image, and then doing a box filter. What about that? (He has no idea about CUDA). Would that be cheaper than a Fourier transform?

If it is separable, then it is rather easy to implement in CUDA, and will run very quickly.

You will have an issue with how to deal with the margins, and there are a number of approaches to the problem.

if the kernel length is less than 128, then rolling your own probably will be the fastest approach.

As pointed out in your link, the nvidia separable convolution sample code is pretty fast, and includes a whitepaper:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-separable-convolution

Hello,

According to cuDNN: Efficient Primitives for Deep Learning suggests using cublas’ GEMM routine is faster to do general 2d convolution than the direct convolution of a mask over an image.


GEMM approach uses more memory to prepare the image ready for matrix operation which is highly parallelizable.

I am wondering with newer hardware GTX TITAN family has 48KB shared memory per block. New Pascal systems will have bigger shared memory per block.
Once the shared memory is big enough, we can load an entire image (stride=1), filter mask, and the output matrix in the shared memory. Wouldn’t direct mask-image convolution be both faster and more memory efficient than’s GEMM approach?

My question is whether using direct convolution approach is more future ready than the GEMM route. The direct convolution method uses less memory and is easy to code. GPU driver handles the caching.

Thank you for your insight.