CUDA 6.5 Sample boxFilter benchmark runs slower than OpenCV

RQi · November 19, 2015, 1:36am

Hi all.

I run the benchmark of CUDA 6.5 Sample boxFilter, and found it prints “Time = 0.12124s”.
It is much slower than OpenCV function blur() with 47ms with the same filter size and input image.

Also, NVIDIA provides an OpenCL version oclboxFilter sample, which prints “Time = 0.00295s” with also the same filter size and input image.

Dose anyone know why it is so slow?

I have not change anything in that sample codes.

Thank you very much in advance!

RQi · November 24, 2015, 5:11am

I’ve found the answer.
Use the Release mode rather then Debug mode.
In Release mode, it took 1.3 milliseconds.

HannesF99 · November 24, 2015, 7:38am

Use the Arrayfire library, i suppose its significantly faster than the SDK sample

njuffa · November 24, 2015, 7:52am

Generally speaking, the sample apps that ship with CUDA are designed as vehicles that demonstrate the use of some particular functionality. They are not functions from a high-performance library, therefore performance comparison with other such libraries does not make sense. You might want to check the NPP library to see whether it offers box filter functionality.

RQi · December 3, 2015, 1:43am

Thank you HannesF99 and njuffa.
I have checked with NPP library, which does offer a box filter function, but much slower than the sample code.
It took 20 milliseconds while sample code need only 1.3 milliseconds.
I’ll check ArrayFire.

HannesF99 · December 3, 2015, 8:47am

I can’t believe that the NPP box filter is more than one order of magnitude slower than the sample code from the Cuda SDK (20 milliseconds is an eternity on the GPU).You sure you measure the time correctly, and in Release configuration ? What size and datatype is the image, and which size has the box kernel ?

Robert_Crovella · December 3, 2015, 12:34pm

I believe there will be some performance improvements in NPP in the next CUDA release, as a result of discussions here:

[url]NPP libray fucntions call speed issue - GPU-Accelerated Libraries - NVIDIA Developer Forums

I don’t know what level of impact that would have on NPP box filter, but if you want to post your test code I can take a look.

564810049 · August 2, 2018, 3:08am

Actually, I still meet the same problem in CUDA 9.1. I use GTX 1060 for nppifilterBoxBorder . It costs about 300ms for a 5471 x 1000 image and 131 x 31 kernal while in OpenCV it just costs about 30ms even with a i5-6300hq CPU. Does it fit a large-size kernal or should I use convonlution or FFT instead?

Here’ the code for cuda:

#include "HalconCpp.h"
#include <stdio.h>
#include <iostream>
#include <string.h>
#include <fstream>

#include <nppi.h>
#include <npp.h>
#include <helper_string.h>
#include <helper_cuda.h>

#include <ImageIO.h>
#include <ImagesNPP.h>
#include <ImagesCPU.h>

#include <cuda_runtime.h>

int main()
{
	cudaEvent_t time1, time2;
	float time;
	cudaEventCreate(&time1);
	cudaEventCreate(&time2);

	//-----------------------------Halcon_declare-----------------------------------//


	//-----------------------------Halcon_pre_process-----------------------------------//
	
	
	
	npp::ImageCPU_8u_C1 Host_Dst;
	npp::loadImage(".\test3_test.pgm", Host_Dst);
	npp::ImageNPP_8u_C1 Device_src(Host_Dst);
	npp::ImageNPP_8u_C1 Device_dst(Host_Dst.size());
	NppiSize roi_size = { (int)Device_src.width(), (int)Device_src.height() };
	NppiPoint offset = { 0, 0 };
	NppiSize blur_mask_size = { 131, 31 };
	NppiPoint mask_anchor = { (blur_mask_size.width - 1) / 2, (blur_mask_size.height - 1) / 2 };


	
	nppiFilterBoxBorder_8u_C1R(
		Device_src.data(), Device_src.pitch(),
		roi_size, offset,
		Device_dst.data(), Device_dst.pitch(),
		roi_size, blur_mask_size, mask_anchor, NPP_BORDER_REPLICATE
	);


	cudaEventRecord(time1, 0);
	nppiFilterBoxBorder_8u_C1R(
		Device_src.data(), Device_src.pitch(),
		roi_size, offset,
		Device_dst.data(), Device_dst.pitch(),
		roi_size, blur_mask_size, mask_anchor, NPP_BORDER_REPLICATE
	);
	cudaEventRecord(time2, 0);

	cudaEventSynchronize(time2);
	cudaEventElapsedTime(&time, time1, time2);
	printf("time: %.2f\n", time);
	system("pause");
}

In OpenCV, I just use blur() to get the result:

double t1 = clock();
	blur(input, conv_mean, Size(131, 31));
	double t2 = clock();

HannesF99 · August 2, 2018, 1:22pm

For such big kernel sizes you should definitly employ FFT for the convolution. Is such a big box filter kernel size really necessary ? On the CPU, furthermore one can do some strategies to speedup box filter calculation, which are not really easy to map on the GPU.

564810049 · August 6, 2018, 9:28am

Yeah, that’s right. But I have to use it in the middle of the GPU process.

Also, I test several different way to calculate box filter, for a large-size mask(at lease for larger than 100 x 50). The best way I know is the Integral Image. In my test, the nppiBoxFilter costs about 300ms, convolution costs about 150ms and Integral costs only 20-30ms, which is much faster. However, it is truly complicated to use integral image by nppi library. I’ll make a request to see if it can be improved.

Hope This can help someone.

Topic		Replies	Views
NPP boxfilter performance GPU-Accelerated Libraries performance , npp	1	1137	November 17, 2021
CUDA SDK Boxfilter examlpe how to use boxfilter functions? CUDA Programming and Performance	1	1433	November 17, 2021
Very poor performance with NPP CrossCorrValid GPU-Accelerated Libraries npp	8	3296	May 25, 2022
NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R too slow??? CUDA Programming and Performance	8	1522	March 7, 2015
Performance issue of new filter median functions in NPP GPU-Accelerated Libraries	0	1737	May 31, 2014
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1527	September 12, 2018
Calling NPP helper with large image gives kernel execution error GPU-Accelerated Libraries npp	3	1816	November 11, 2021
Extremely slow CUDA API calls? Jetson TX1	6	2881	October 18, 2021
Why using [&] is 10% slower than [=] when wrapping a kernel function call in a lambda expression? CUDA Programming and Performance cuda , kernel	9	1166	February 21, 2023
Abnormally slow performance GPU-Accelerated Libraries	3	839	February 4, 2019

CUDA 6.5 Sample boxFilter benchmark runs slower than OpenCV

Related topics