I run the benchmark of CUDA 6.5 Sample boxFilter, and found it prints “Time = 0.12124s”.
It is much slower than OpenCV function blur() with 47ms with the same filter size and input image.
Also, NVIDIA provides an OpenCL version oclboxFilter sample, which prints “Time = 0.00295s” with also the same filter size and input image.
Generally speaking, the sample apps that ship with CUDA are designed as vehicles that demonstrate the use of some particular functionality. They are not functions from a high-performance library, therefore performance comparison with other such libraries does not make sense. You might want to check the NPP library to see whether it offers box filter functionality.
Thank you HannesF99 and njuffa.
I have checked with NPP library, which does offer a box filter function, but much slower than the sample code.
It took 20 milliseconds while sample code need only 1.3 milliseconds.
I’ll check ArrayFire.
I can’t believe that the NPP box filter is more than one order of magnitude slower than the sample code from the Cuda SDK (20 milliseconds is an eternity on the GPU).You sure you measure the time correctly, and in Release configuration ? What size and datatype is the image, and which size has the box kernel ?
Actually, I still meet the same problem in CUDA 9.1. I use GTX 1060 for nppifilterBoxBorder . It costs about 300ms for a 5471 x 1000 image and 131 x 31 kernal while in OpenCV it just costs about 30ms even with a i5-6300hq CPU. Does it fit a large-size kernal or should I use convonlution or FFT instead?
For such big kernel sizes you should definitly employ FFT for the convolution. Is such a big box filter kernel size really necessary ? On the CPU, furthermore one can do some strategies to speedup box filter calculation, which are not really easy to map on the GPU.
Yeah, that’s right. But I have to use it in the middle of the GPU process.
Also, I test several different way to calculate box filter, for a large-size mask(at lease for larger than 100 x 50). The best way I know is the Integral Image. In my test, the nppiBoxFilter costs about 300ms, convolution costs about 150ms and Integral costs only 20-30ms, which is much faster. However, it is truly complicated to use integral image by nppi library. I’ll make a request to see if it can be improved.