Very poor performance with NPP CrossCorrValid

We’re investigating if Npp can give any performance over Ipp on image processing, and part of it is to compare nppiCrossCorrValid_NormLevel_8u32f_c1r and ippiCrossCorrValid_NormLevel_8u32f_C1R.

The GPU is a GTX 750 Ti, and the CPU is a i7 3770. Source image is 409600 bytes. Template image is 64640 bytes, and the destination image is 153840 bytes. Source ROI is 1280x160. Template ROI is 640x101.

On the 750 Ti, the Npp call is taking around 800ms. On the CPU (exact same image and configuration), execution takes 9ms.

I ran the code with the Nvidia Visual Profiler, and there’s 3 kernels that are launched: SignalReductionKernel, ImageReductionKernel, and ForEachPixelNaive.

Each execution of ImageRecutionKernel takes about 25us (microseconds). The SignalReductionKernel is typically around 5-6us.

The monster in the room is the ForEachPixelNaive, which takes a whopping 800ms to run. That’s an 80x slowdown over the CPU. The fact the kernel name has “naive” right in the name, along with “ForEach”, indicates this may not be optimal way to perform cross correlation.

This is on Cuda 6.5 32 bit. I’m going to try 7.0 64 bit, and see if that has any improvements.

Is there a way to speed up the NPP cross correlation?

I tried cuda 7.0 64 bit. Npp execution time went up to 3,000ms.

If you want to provide a complete test code, I will take a look.

It needs to be a code that I can copy, paste, compile, and run, without having to add anything or change anything. Likewise, it should include the complete code for the CPU comparison.

If that’s a problem, perhaps someone else will have some suggestions.

Not quite sure how I can pull that off, seeing as I don’t know what OS you’re running, where you keep headers / libraries, etc, but this code should run on Windows or Linux. It needs to be linked against ippi.lib, nppi.lib, and cudart.lib.

The GPU takes 1600ms to run (GTX 750 ti). CPU is taking around 9-10ms.

#include "ippi.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "npp.h"
#include <sstream>
#include <iostream>

#include <stdio.h>

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
static double second(void)
   static double oofreq;
   static int checkedForHighResTimer;
   static BOOL hasHighResTimer;

   if (!checkedForHighResTimer) {
      hasHighResTimer = QueryPerformanceFrequency(&t);
      oofreq = 1.0 / (double)t.QuadPart;
      checkedForHighResTimer = 1;
   if (hasHighResTimer) {
      return (double)t.QuadPart * oofreq;
   else {
      return (double)GetTickCount() / 1000.0;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
static double second(void)
   struct timeval tv;
   gettimeofday(&tv, NULL);
   return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
#error unsupported platform

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
   if (code != cudaSuccess)
      std::cout << "GPUassert: " << cudaGetErrorString(code) << " " << file << ", line:" << line << std::endl;
      if (abort) exit(code);

void GPUCorrelation(const Npp8u *offsetSourceBuffer, const size_t offsetSourceBufferSize, int offsetSourceBufferStep,
   const NppiSize srcBufferROISize,
   const Npp8u *refBuffer, const size_t refBufferSize, int refBufferStep,
   const NppiSize refBufferROISize,
   const size_t dstBufferSize, int dstBufferStep)
   Npp8u *d_offsetSourceBuffer,
   Npp32f *d_dstBuffer;

   gpuErrchk(cudaMalloc((void**)(&d_offsetSourceBuffer), offsetSourceBufferSize));
   gpuErrchk(cudaMemcpy((void*)d_offsetSourceBuffer, (void*)offsetSourceBuffer, offsetSourceBufferSize, cudaMemcpyHostToDevice));

   gpuErrchk(cudaMalloc((void**)(&d_refBuffer), refBufferSize));
   gpuErrchk(cudaMemcpy((void*)d_refBuffer, (void*)refBuffer, refBufferSize, cudaMemcpyHostToDevice));

   gpuErrchk(cudaMalloc((void**)(&d_dstBuffer), dstBufferSize));

   int scratchSize = 0;
   nppiValidNormLevelGetBufferHostSize_8u32f_C1R(srcBufferROISize, &scratchSize);
   gpuErrchk(cudaMalloc((void**)(&d_scratchBuffer), scratchSize));

   cudaEvent_t start, stop;

      offsetSourceBufferStep, //1280


   float milliseconds = 0;
   cudaEventElapsedTime(&milliseconds, start, stop);
   std::cout << "GPU Execution time: " << milliseconds <<"ms"<< std::endl;



void CPUCorrelation(const Ipp8u *offsetSourceBuffer, int offsetSourceBufferStep,
   const IppiSize &sourceBufferROISize, const Ipp8u *refBuffer, int refBufferStep,
   const IppiSize &refBufferSize, Ipp32f *result, int resultStep)
   double cpuStart = second();

   IppStatus intelStatus = ippiCrossCorrValid_NormLevel_8u32f_C1R(offsetSourceBuffer,

   double cpuEnd = second();
   std::cout << "CPU Execution time: " << (cpuEnd - cpuStart) * 1000 <<"ms" <<  std::endl;

const size_t sourceBufferSize = 409600;
const int srcBufferWidth = 1280;
const int srcBufferHeight = 160;

const int refBufferROIWidth = 640;
const int refBufferROIHeight = 101;
const size_t refBufferSize = 64640;

int offsetSourceBufferStep = 1280;
int refBufferStep = 640;

const size_t dstBufferSize = 153840;
const int dstBufferWidth = 641;
const int dstBufferHeight = 60;
int dstBufferStep = 2592;

void RunNpp()
   NppiSize srcBufferROISize;
   srcBufferROISize.width = srcBufferWidth;
   srcBufferROISize.height = srcBufferHeight;
   int ofstCount = sourceBufferSize / sizeof(Ipp8u);
   Npp8u *offsetSourceBuffer = new Npp8u[ofstCount];
   for (int i = 0; i < ofstCount; i++)
      offsetSourceBuffer[i] = (Npp8u)i;

   NppiSize refBufferROISize;
   refBufferROISize.width = refBufferROIWidth;
   refBufferROISize.height = refBufferROIHeight;
   int refBufferCount = refBufferSize / sizeof(Ipp8u);
   Npp8u *refBuffer = new Npp8u[refBufferCount];
   for (int i = 0; i < refBufferCount; i++)
      refBuffer[i] = (Npp8u) i * 5;

   GPUCorrelation(offsetSourceBuffer, sourceBufferSize, offsetSourceBufferStep,
      refBuffer, refBufferSize, refBufferStep,
      dstBufferSize, dstBufferStep);

   delete[] offsetSourceBuffer;
   delete[] refBuffer;

void RunIpp()
   IppiSize srcBufferROISize;
   srcBufferROISize.width = srcBufferWidth;
   srcBufferROISize.height = srcBufferHeight;
   int ofstCount = sourceBufferSize / sizeof(Ipp8u);
   Ipp8u *offsetSourceBuffer = new Ipp8u[ofstCount];
   for (int i = 0; i < ofstCount; i++)
      offsetSourceBuffer[i] = (Ipp8u)i;

   IppiSize refBufferROISize;
   refBufferROISize.width = refBufferROIWidth;
   refBufferROISize.height = refBufferROIHeight;
   int refBufferCount = refBufferSize / sizeof(Ipp8u);
   Ipp8u *refBuffer = new Ipp8u[refBufferCount];
   for (int i = 0; i < refBufferCount; i++)
      refBuffer[i] = (Ipp8u) i * 5;

   Ipp32f *dstBuffer = ippiMalloc_32f_C1(dstBufferWidth, dstBufferHeight, &dstBufferStep);

   CPUCorrelation(offsetSourceBuffer, offsetSourceBufferStep,
      srcBufferROISize, refBuffer, refBufferStep,
      refBufferROISize, dstBuffer, dstBufferStep);

   delete[] offsetSourceBuffer;
   delete[] refBuffer;

int main(void)


I wish I could attach a photo of the profiler. The longest running kernel is

void ForEachPixelNaive<float, int=1, ValidCrossCorrNormLeve<Pixel<unsigned char, int=1>, Pixel<double, int=1>, AssignOperation<double, float>>>(Image<float, int=1>, NppiSize, unsigned char)

Apparently for smaller template sizes, NPP can be faster than IPP for this function. However this isn’t one of those cases. Anyway I’ve filed a performance bug with the team responsible for this library. They are aware of the issue. I don’t have any further details on it at this time.

One note about NPP benchmarking: The first call to any CUDA library function may involve a significant start-up overhead. Whether or not this factors into your comparison or not is your decision of course, however in this case I took your code and added an extra “warm-up” call to the main nppi function, immediately prior to the timing area, and it resulted in a significant (~2x) reduction in the execution time of the code in the timing area. Not enough to swing the balance in favor of NPP, however.

Thanks for you help.

Is the “warm-up” call just any nnpi function, or is there a special / recommended function to get the initialization done?

For benchmarking purposes, call the function you intend to benchmark. For other use, be aware that the first time you call a function it may take longer to execute.