NPP performance problems not impressing at all...


Well I wanted to test my GTX460 against my Corei5-750. For the CPU I compiled OpenCV with all the optimizations available for thsi processor(except TBB). The timing is done in this code segment:

#pragma omp parallel for

for(int j = 0; j<nIterations; j++){

    cvSmooth( img, dst, CV_MEDIAN, 5, 5);


For the GTX 460 I used NPP. Since in CUDA all calls are asynchronous, I’m timing this code:

for(int j = 0; j<nIterations; j++){

                eStatusNPP = nppiFilterMax_8u_C1R(, oDeviceSrc.pitch(), 

                                        , oDeviceDst.pitch(), 

                                                  oSizeROI, oMaskSize, oAnchor);


oDeviceDst.copyTo(, oHostDst.pitch());

In the CPU case all 4 cores go up to 100% as expected, in the GPU case one core goes to 100%, i think this is because it’s always sending the command to the GPU right?

But for my surprise… but libraries take the same time (~260 seconds) to make the processing!!!

nIterations = 100000

Image Size: 2048x1024

I’m using Linux Ubuntu 64 bits with the 260.19.06 driver.

nppGetGpuName() returns GeForce GTX 460 and with nvidia-settings i’m watching the GPU is at performance level 3 while processing.

Do I’m missing somethign with NPP?