how to improve performance of npp resize function?

Hi,

I have written codes to resize a RGB24 image to smaller using nppiResizeSqrPixel_8u_C3R like below.
I run the blow code to resize RGB24(1280x720) to half size(scalerFactor==0.5f)
I have a good result I wish but it takes too long time to get the result than ffmpeg scaler.

npp resizer : 248.52 ms (exclude time to run cudaMemcpy,cudaMalloc)
ffmpeg scaler : 4.25 ms

How to improve npp resize function performance?

/////////////////////////////////////////////////////////
cudaMalloc( (void**)&devSrc,nSrcSize);
cudaMalloc( (void**)&devDst,nDstSize);

QueryPerformanceCounter(&swStart);

cudaMemcpy((void*)devSrc,(void*)dc.GetImageData(),nSrcSize,cudaMemcpyHostToDevice);

NppiSize oSrcSize; oSrcSize.width = nSrcW; oSrcSize.height = nSrcH;
NppiRect oSrcROI = {0,0,nSrcW,nSrcH};
NppiRect oDstROI = {0,0,nDstW,nDstH};

nppiResizeSqrPixel_8u_C3R(devSrc, //RGB24 image data
oSrcSize,
nSrcW3, // stride
oSrcROI,
devDst,
nDstW
3,
oDstROI,
nScaleFactor, // nXFactor
nScaleFactor, // nYFactor
0, // nXShift
0, // nYShift
NPPI_INTER_LINEAR
);

cudaMemcpy((void*)hostDst,(void*)devDst,nDstSize,cudaMemcpyDeviceToHost);

QueryPerformanceCounter(&swEnd);
fTimeElapsed = ((swEnd.QuadPart-swStart.QuadPart)/(float)swFreq.QuadPart)*1000;

printf(“image Npp scaling completed!! elapsed time = %f ms \n”,fTimeElapsed);

cudaFree(devSrc);
cudaFree(devDst);
/////////////////////////////////////////////////

You might be doing something wrong. FFmpeg does in fact include a CUDA scaler, called scale_npp, which uses nppiResizeSqrPixel_8u_C1R(). It is noticeably faster than the software scaler. I suggest you look at their code and see what you could do differently.

FFMpeg scaler that I used is software scaler so I just want to compare two scalers.
I modified my code like below so performance is noticeably improved.
Elapsed time is 0.98745ms (exclude time to run nppiMalloc_8u_C3() ).

but image resized is not perfect.
https://www.dropbox.com/s/9xersbbpst4dad4/scaled.png?dl=0

What is wrong in the codes?

/////////////////////////////////////////////
int srcStep = 0;
int dstStep = 0;
devSrc = nppiMalloc_8u_C3(nSrcW,nSrcH,&srcStep);
devDst = nppiMalloc_8u_C3(nDstW,nDstH,&dstStep);

QueryPerformanceCounter(&swStart);

cudaMemcpy2D(devSrc,srcStep,(void*)dc.GetImageData(),nSrcW3,nSrcW,nSrcH,cudaMemcpyHostToDevice);
//cudaMemcpy((void
)devSrc,(void*)dc.GetImageData(),nSrcSize,cudaMemcpyHostToDevice);
NppiSize oSrcSize; oSrcSize.width = nSrcW; oSrcSize.height = nSrcH;
NppiRect oSrcROI = {0,0,nSrcW,nSrcH};
NppiRect oDstROI = {0,0,nDstW,nDstH};

status = nppiResizeSqrPixel_8u_C3R(devSrc,
oSrcSize,
srcStep,//nSrcW3, // stride
oSrcROI,
devDst,
dstStep,//nDstW
3,
oDstROI,
nScaleFactor, // nXFactor
nScaleFactor, // nYFactor
0.5, // nXShift
0.5, // nYShift
NPPI_INTER_LINEAR);

cudaMemcpy2D(hostDst, nDstW3,(void)devDst,dstStep,nDstW,nDstH,cudaMemcpyDeviceToHost);

QueryPerformanceCounter(&swEnd);

fTimeElapsed = ((swEnd.QuadPart-swStart.QuadPart)/(float)swFreq.QuadPart)*1000;
printf(“image Npp scaling completed!! elapsed time = %f ms \n”,fTimeElapsed);
nppiFree(devSrc);
nppiFree(devDst);
//////////////////////////////////////////////////////////////////////////

I figured out what is wrong in my code.

cudaMemcpy2D(devSrc,srcStep,(void*)dc.GetImageData(),nSrcW3,nSrcW,nSrcH,cudaMemcpyHostToDevice);
change to ==>
cudaMemcpy2D(devSrc,srcStep,(void
)dc.GetImageData(),nSrcW3,[b]nSrcW3[/b],nSrcH,cudaMemcpyHostToDevice);

pitch of Src RGB image is width*3.