CUDA 6.5 Sample boxFilter benchmark runs slower than OpenCV

Hi all.

I run the benchmark of CUDA 6.5 Sample boxFilter, and found it prints “Time = 0.12124s”.
It is much slower than OpenCV function blur() with 47ms with the same filter size and input image.

Also, NVIDIA provides an OpenCL version oclboxFilter sample, which prints “Time = 0.00295s” with also the same filter size and input image.

Dose anyone know why it is so slow?

I have not change anything in that sample codes.

Thank you very much in advance!

I’ve found the answer.
Use the Release mode rather then Debug mode.
In Release mode, it took 1.3 milliseconds.

Use the Arrayfire library, i suppose its significantly faster than the SDK sample

Generally speaking, the sample apps that ship with CUDA are designed as vehicles that demonstrate the use of some particular functionality. They are not functions from a high-performance library, therefore performance comparison with other such libraries does not make sense. You might want to check the NPP library to see whether it offers box filter functionality.

Thank you HannesF99 and njuffa.
I have checked with NPP library, which does offer a box filter function, but much slower than the sample code.
It took 20 milliseconds while sample code need only 1.3 milliseconds.
I’ll check ArrayFire.

I can’t believe that the NPP box filter is more than one order of magnitude slower than the sample code from the Cuda SDK (20 milliseconds is an eternity on the GPU).You sure you measure the time correctly, and in Release configuration ? What size and datatype is the image, and which size has the box kernel ?

I believe there will be some performance improvements in NPP in the next CUDA release, as a result of discussions here:

I don’t know what level of impact that would have on NPP box filter, but if you want to post your test code I can take a look.

Actually, I still meet the same problem in CUDA 9.1. I use GTX 1060 for nppifilterBoxBorder . It costs about 300ms for a 5471 x 1000 image and 131 x 31 kernal while in OpenCV it just costs about 30ms even with a i5-6300hq CPU. Does it fit a large-size kernal or should I use convonlution or FFT instead?

Here’ the code for cuda:

#include "HalconCpp.h"
#include <stdio.h>
#include <iostream>
#include <string.h>
#include <fstream>

#include <nppi.h>
#include <npp.h>
#include <helper_string.h>
#include <helper_cuda.h>

#include <ImageIO.h>
#include <ImagesNPP.h>
#include <ImagesCPU.h>

#include <cuda_runtime.h>

int main()
	cudaEvent_t time1, time2;
	float time;


	npp::ImageCPU_8u_C1 Host_Dst;
	npp::loadImage(".\test3_test.pgm", Host_Dst);
	npp::ImageNPP_8u_C1 Device_src(Host_Dst);
	npp::ImageNPP_8u_C1 Device_dst(Host_Dst.size());
	NppiSize roi_size = { (int)Device_src.width(), (int)Device_src.height() };
	NppiPoint offset = { 0, 0 };
	NppiSize blur_mask_size = { 131, 31 };
	NppiPoint mask_anchor = { (blur_mask_size.width - 1) / 2, (blur_mask_size.height - 1) / 2 };

	nppiFilterBoxBorder_8u_C1R(, Device_src.pitch(),
		roi_size, offset,, Device_dst.pitch(),
		roi_size, blur_mask_size, mask_anchor, NPP_BORDER_REPLICATE

	cudaEventRecord(time1, 0);
	nppiFilterBoxBorder_8u_C1R(, Device_src.pitch(),
		roi_size, offset,, Device_dst.pitch(),
		roi_size, blur_mask_size, mask_anchor, NPP_BORDER_REPLICATE
	cudaEventRecord(time2, 0);

	cudaEventElapsedTime(&time, time1, time2);
	printf("time: %.2f\n", time);

In OpenCV, I just use blur() to get the result:

double t1 = clock();
	blur(input, conv_mean, Size(131, 31));
	double t2 = clock();

For such big kernel sizes you should definitly employ FFT for the convolution. Is such a big box filter kernel size really necessary ? On the CPU, furthermore one can do some strategies to speedup box filter calculation, which are not really easy to map on the GPU.

Yeah, that’s right. But I have to use it in the middle of the GPU process.

Also, I test several different way to calculate box filter, for a large-size mask(at lease for larger than 100 x 50). The best way I know is the Integral Image. In my test, the nppiBoxFilter costs about 300ms, convolution costs about 150ms and Integral costs only 20-30ms, which is much faster. However, it is truly complicated to use integral image by nppi library. I’ll make a request to see if it can be improved.

Hope This can help someone.