Arbitrary 2D convolution

What is currently best open source
arbitrary 2D convolution implementation
in CUDA?

The kernel is non-separable.

Thanks for any suggestions.

Could you post an example of 2D convolution maths formulas? I guess you can use CUBLAS optimized libraries to do that kind of operations. I guess you can see a 2D convolution problem equivalent to a 2D reduction problem on an array?

Pascal

There is a great presentation on convolutions by Joe Stam:

I just don’t know if this is current state-of-the-art
or is there something else out there.

When you say ‘best open source arbitrary 2D convolution implementation,’ you have to be careful. The ‘best’ arbitrary convolution solution that handles all kernel sizes will certainly be worse than one that can say, fit into shared memory. Also, at some point, the number of ops pushes you to do the convolution in frequency space via an FFT. There is no “best … arbitrary”, unless it looks at the size, looks at your compute capability to determine what it can store in shared memory, then possibly runs a sample FFT (after planning it) for your specific size and then compares the timing to a shared vs. texture based before it decides which path it’s going to take. I’m not saying this can’t be done, but there’s no ‘best’ solution that handles all sizes. The relative performance of several methods will vary widely (or perhaps not even work) with different sized convolutants (? a Bush-ism?).

I found an interesting paper out there that sort of delves into this problem and draws the line at 31-41 pixels as the transition between a texture based image space convolution and a frequency spaced FFT solution. Unfortunately, the paper is 6 years old, so it’s most likely out of date (especially since it’s talking about 6000 and 7000 series chips). However, the discussion inside is still relevant.

Here’s another paper that compares CPU’s (with SSE), GPU’s and some Xilinx FPGA’s, but it’s also old (5000/6000 series and Spartan/Virtex II series). This one directly compares the throughput of each system in MP/s for various square kernels (when you say ‘arbitrary,’ did you mean ‘arbitrary sized square kernels’, or can the kernels be rectangular…?), but only from 2x2 → 11x11 (this is probably limited by memory somewhere, or resource availability in the FPGA).

DO you plan to do it real space or inverse space. The short range are effective in real space while the long range in inverse. space with FFT.