We’re investigating if Npp can give any performance over Ipp on image processing, and part of it is to compare nppiCrossCorrValid_NormLevel_8u32f_c1r and ippiCrossCorrValid_NormLevel_8u32f_C1R.
The GPU is a GTX 750 Ti, and the CPU is a i7 3770. Source image is 409600 bytes. Template image is 64640 bytes, and the destination image is 153840 bytes. Source ROI is 1280x160. Template ROI is 640x101.
On the 750 Ti, the Npp call is taking around 800ms. On the CPU (exact same image and configuration), execution takes 9ms.
I ran the code with the Nvidia Visual Profiler, and there’s 3 kernels that are launched: SignalReductionKernel, ImageReductionKernel, and ForEachPixelNaive.
Each execution of ImageRecutionKernel takes about 25us (microseconds). The SignalReductionKernel is typically around 5-6us.
The monster in the room is the ForEachPixelNaive, which takes a whopping 800ms to run. That’s an 80x slowdown over the CPU. The fact the kernel name has “naive” right in the name, along with “ForEach”, indicates this may not be optimal way to perform cross correlation.
This is on Cuda 6.5 32 bit. I’m going to try 7.0 64 bit, and see if that has any improvements.
Is there a way to speed up the NPP cross correlation?
Not quite sure how I can pull that off, seeing as I don’t know what OS you’re running, where you keep headers / libraries, etc, but this code should run on Windows or Linux. It needs to be linked against ippi.lib, nppi.lib, and cudart.lib.
The GPU takes 1600ms to run (GTX 750 ti). CPU is taking around 9-10ms.
Apparently for smaller template sizes, NPP can be faster than IPP for this function. However this isn’t one of those cases. Anyway I’ve filed a performance bug with the team responsible for this library. They are aware of the issue. I don’t have any further details on it at this time.
One note about NPP benchmarking: The first call to any CUDA library function may involve a significant start-up overhead. Whether or not this factors into your comparison or not is your decision of course, however in this case I took your code and added an extra “warm-up” call to the main nppi function, immediately prior to the timing area, and it resulted in a significant (~2x) reduction in the execution time of the code in the timing area. Not enough to swing the balance in favor of NPP, however.