2D CUDA convolution

DoctorG · January 9, 2015, 6:17pm

Do you have patience to answer an novice?

I need to convolve a kernel (10x10 float ) over many 2K x 2K images (float). Is there something already in the cuBLAS or cuFFT (for cuFFT I assume I would have to convert the image and the kernel to Fourier space first) for doing this? (Let’s assume I can’t use openCV unless it is to copy the source)

Or should I roll my own along the lines of: CUDA Convolution

It might be that I can get by with a smaller kernel. Yes it is separable, For the moment our notion is that it is Gaussian, but a match kernel is probably even more to our liking. (we are looking for small star like objects in a black field).

One image processing guy suggested first creating a integral image, and then doing a box filter. What about that? (He has no idea about CUDA). Would that be cheaper than a Fourier transform?

CudaaduC · January 9, 2015, 6:47pm

If it is separable, then it is rather easy to implement in CUDA, and will run very quickly.

You will have an issue with how to deal with the margins, and there are a number of approaches to the problem.

if the kernel length is less than 128, then rolling your own probably will be the fastest approach.

Robert_Crovella · January 9, 2015, 7:20pm

As pointed out in your link, the nvidia separable convolution sample code is pretty fast, and includes a whitepaper:

[url]CUDA Samples :: CUDA Toolkit Documentation

xunlei · May 2, 2016, 12:52am

Hello,

According to cuDNN: Efficient Primitives for Deep Learning suggests using cublas’ GEMM routine is faster to do general 2d convolution than the direct convolution of a mask over an image.

GEMM approach uses more memory to prepare the image ready for matrix operation which is highly parallelizable.

I am wondering with newer hardware GTX TITAN family has 48KB shared memory per block. New Pascal systems will have bigger shared memory per block.
Once the shared memory is big enough, we can load an entire image (stride=1), filter mask, and the output matrix in the shared memory. Wouldn’t direct mask-image convolution be both faster and more memory efficient than’s GEMM approach?

My question is whether using direct convolution approach is more future ready than the GEMM route. The direct convolution method uses less memory and is easy to code. GPU driver handles the caching.

Thank you for your insight.

Topic		Replies	Views
CUDA OpenGL post-processing example CUDA Programming and Performance	9	13252	May 27, 2007
Image Convolution CUDA Programming and Performance	1	1988	December 30, 2008
Image Convolution [src added] CUDA Programming and Performance	3	3890	November 28, 2007
General Convolution CUDA Programming and Performance	7	2919	April 21, 2009
Cross-Correlation with CUFFT CUDA Programming and Performance	6	7051	August 14, 2009
3d convolutions and correlations Any experience with 3d filtering? CUDA Programming and Performance	3	8841	October 4, 2007
what is the best way to implement a convolution with CUDA? CUDA Programming and Performance	1	9789	April 2, 2010
Kernel Convolution with streams provides no benefit CUDA Programming and Performance	4	41	January 20, 2025
Arbitrary 2D convolution CUDA Programming and Performance	4	6339	February 17, 2012
Optimizing color channels in image processing (Gaussian blur) CUDA Programming and Performance	4	1832	June 18, 2017

2D CUDA convolution

Related topics