The best algorithm of Gaussian fliter in Guda

Hello Everyone:
Happy new year! haha.gif
This is my first time posting to the Nvidia Forum. And of course I am quite new to GPGPU(just a few weeks)
I was wondering : anybody tried to implement a recursive Gaussian filter like Ian T.Young did in his paper(not the deriche’s algorithm in the SDK examples, which has two disadvantages: it is not circular symmetric in 2-D and is not the target filter, the Gaussian). In my mind and experiment in the C code, It is the best algorithm of the Gaussian filter until now.
The Ian T.Young’ paper:

the better parameter in his another paper:

the boundary conditions for the filter:[attachment=8170:boundary…e_filter.pdf]

And certain anisotropic(not symmetric) Gaussian filters with arbitrary orientation are of great importance in image processing. So Stanley and Bertram Shi proposed an algorithm to separate the anisotropic Gaussian filter into three one-dimention Gaussian filters(ordinary separable filter plus 45degree filter or -45 degree filter), which do not need interpolation. So It make the algorithm more parallel and suitable to the GPGPU process.
the paper of Stanley and Bertram E. Shi

I am wondering someone with sophisticate cuda program skill is interesting with this code. thumbup.gif If it had much greater efficiency than the C code, it will become the standard Gaussian GPU code. And it will be very useful for me and other researchers(who want to use the Nvidia Graphic Card to accelerate the whole code)

The C code composed: (51.8 KB)

boundary_recursive_filter.pdf (72.1 KB)
Stanley___TripleAxis_Decomposition.pdf (497 KB)
Recursive_implementation_of_the_Gaussian_filter.pdf (562 KB)
Recursive_Gabor_Filtering.pdf (341 KB)

no one interested?! :(

the first thing i programmed in cuda was a 3d gaussian filter. i used the naïve approach with the 3 separated 1D filter kernels and it was upto 150x faster on a gtx260 than on my dual-opteron. that was way more than fast enough for my application, so i didn’t even care to get the last bit of performance out of this. ;-)
so, for me, it just isn’t that appealing to get a faster version as i already got one that’s faster than i need it to be.

In my mind, the advantage of above algorithm is that the speed of the Gaussian operation has nothing to do with the smooth level of the Gaussian filter(the sigma), while the width of the ordinary Gaussian filter varied with the sigma(lager sigma, larger width, more operation(each pixel), slower speed). The operation of each pixel in above algorithm is determined, 6 MADD(multiply and add).

just 6 madd per pixel? that’s fast… could be too fast. if it reaches a throughput of more than 2.5 GB/s or 5 GB/s (if you have more than 1 image) with SSE on a nehalem, it won’t be any faster on a gpu, because you need the same time to get your data to the gpu and back. it would then only benefit from the gpu if you already have your data there or you could leave it there for further processing.

ps: i’ll definately take a look at this paper next week, as i simply can’t imagine to do a gaussian filtering with 6 madd ;-)

The memory bandwidth is the problem! And in my expectation, the C code of above algorithm will be much faster the ordinary algorithm. But the Cuda code of above algorithm may not much faster than the same C code.

speed of c code and cuda code is not comparable beside the hardware!

how to implement the convolution faster?

I want use the GPU to accelerate the speed of the program. If the speed of cuda code of the algorithm is not much higher than the c code of the algorithm, there is no meaning for the project

Anyone knows how to implement the boundary conditions for the methods described in the “Boundary conditons for Young - van Vliet recursive filtering” for the filters given in “Recursive implementation of the Gaussian filter”?? >.<
It’s killing me at this time… :wacko:

Sure there is. An algorithm like guassian filtering doesn’t always need to be run in isolation. If it’s part of a larger pipeline, then having an efficient GPU implementation lets you keep the data on the GPU longer and thus increases your chances of having an end-to-end pipeline that outperforms a CPU-only implementation.

I agree with that, one use of a good and real fast gaussian filter may be to generate thumbnails from RAW or JPEG files, where the GPU will transform RAW (SLR sensor raw data) or JPEG files into a 2D image array, then apply filters (including Gaussian) and generate thumbnails.

This is useful in tools like Aperture or Adobe Photoshop Lightroom, as well as Mac OS X Finder, Windows Explorer, GNome or KDE, as well as for our picture storage website :-)