Very poor performance with NPP CrossCorrValid

curtishx · May 13, 2015, 7:16pm

We’re investigating if Npp can give any performance over Ipp on image processing, and part of it is to compare nppiCrossCorrValid_NormLevel_8u32f_c1r and ippiCrossCorrValid_NormLevel_8u32f_C1R.

The GPU is a GTX 750 Ti, and the CPU is a i7 3770. Source image is 409600 bytes. Template image is 64640 bytes, and the destination image is 153840 bytes. Source ROI is 1280x160. Template ROI is 640x101.

On the 750 Ti, the Npp call is taking around 800ms. On the CPU (exact same image and configuration), execution takes 9ms.

I ran the code with the Nvidia Visual Profiler, and there’s 3 kernels that are launched: SignalReductionKernel, ImageReductionKernel, and ForEachPixelNaive.

Each execution of ImageRecutionKernel takes about 25us (microseconds). The SignalReductionKernel is typically around 5-6us.

The monster in the room is the ForEachPixelNaive, which takes a whopping 800ms to run. That’s an 80x slowdown over the CPU. The fact the kernel name has “naive” right in the name, along with “ForEach”, indicates this may not be optimal way to perform cross correlation.

This is on Cuda 6.5 32 bit. I’m going to try 7.0 64 bit, and see if that has any improvements.

Is there a way to speed up the NPP cross correlation?

curtishx · May 13, 2015, 9:08pm

I tried cuda 7.0 64 bit. Npp execution time went up to 3,000ms.

Robert_Crovella · May 14, 2015, 2:09am

If you want to provide a complete test code, I will take a look.

It needs to be a code that I can copy, paste, compile, and run, without having to add anything or change anything. Likewise, it should include the complete code for the CPU comparison.

If that’s a problem, perhaps someone else will have some suggestions.

curtishx · May 14, 2015, 7:33pm

Not quite sure how I can pull that off, seeing as I don’t know what OS you’re running, where you keep headers / libraries, etc, but this code should run on Windows or Linux. It needs to be linked against ippi.lib, nppi.lib, and cudart.lib.

The GPU takes 1600ms to run (GTX 750 ti). CPU is taking around 9-10ms.

#include "ippi.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "npp.h"
#include <sstream>
#include <iostream>

#include <stdio.h>

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
static double second(void)
{
   LARGE_INTEGER t;
   static double oofreq;
   static int checkedForHighResTimer;
   static BOOL hasHighResTimer;

   if (!checkedForHighResTimer) {
      hasHighResTimer = QueryPerformanceFrequency(&t);
      oofreq = 1.0 / (double)t.QuadPart;
      checkedForHighResTimer = 1;
   }
   if (hasHighResTimer) {
      QueryPerformanceCounter(&t);
      return (double)t.QuadPart * oofreq;
   }
   else {
      return (double)GetTickCount() / 1000.0;
   }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
static double second(void)
{
   struct timeval tv;
   gettimeofday(&tv, NULL);
   return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}
#else
#error unsupported platform
#endif

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
   if (code != cudaSuccess)
   {
      std::cout << "GPUassert: " << cudaGetErrorString(code) << " " << file << ", line:" << line << std::endl;
      if (abort) exit(code);
   }
}

void GPUCorrelation(const Npp8u *offsetSourceBuffer, const size_t offsetSourceBufferSize, int offsetSourceBufferStep,
   const NppiSize srcBufferROISize,
   const Npp8u *refBuffer, const size_t refBufferSize, int refBufferStep,
   const NppiSize refBufferROISize,
   const size_t dstBufferSize, int dstBufferStep)
{
   Npp8u *d_offsetSourceBuffer,
      *d_refBuffer,
      *d_scratchBuffer;
   Npp32f *d_dstBuffer;

   gpuErrchk(cudaMalloc((void**)(&d_offsetSourceBuffer), offsetSourceBufferSize));
   gpuErrchk(cudaMemcpy((void*)d_offsetSourceBuffer, (void*)offsetSourceBuffer, offsetSourceBufferSize, cudaMemcpyHostToDevice));

   gpuErrchk(cudaMalloc((void**)(&d_refBuffer), refBufferSize));
   gpuErrchk(cudaMemcpy((void*)d_refBuffer, (void*)refBuffer, refBufferSize, cudaMemcpyHostToDevice));

   gpuErrchk(cudaMalloc((void**)(&d_dstBuffer), dstBufferSize));

   int scratchSize = 0;
   nppiValidNormLevelGetBufferHostSize_8u32f_C1R(srcBufferROISize, &scratchSize);
   gpuErrchk(cudaMalloc((void**)(&d_scratchBuffer), scratchSize));

   cudaEvent_t start, stop;
   cudaEventCreate(&start);
   cudaEventCreate(&stop);
   cudaEventRecord(start);

   nppiCrossCorrValid_NormLevel_8u32f_C1R(d_offsetSourceBuffer,
      offsetSourceBufferStep, //1280
      srcBufferROISize,
      d_refBuffer,
      refBufferStep,//640
      refBufferROISize,
      d_dstBuffer,
      dstBufferStep,//2592
      d_scratchBuffer);

   cudaEventRecord(stop);
   cudaEventSynchronize(stop);

   float milliseconds = 0;
   cudaEventElapsedTime(&milliseconds, start, stop);
   std::cout << "GPU Execution time: " << milliseconds <<"ms"<< std::endl;

   cudaEventDestroy(start);
   cudaEventDestroy(stop);

   gpuErrchk(cudaFree(d_offsetSourceBuffer));
   gpuErrchk(cudaFree(d_refBuffer));
   gpuErrchk(cudaFree(d_scratchBuffer));
   gpuErrchk(cudaFree(d_dstBuffer));
}

void CPUCorrelation(const Ipp8u *offsetSourceBuffer, int offsetSourceBufferStep,
   const IppiSize &sourceBufferROISize, const Ipp8u *refBuffer, int refBufferStep,
   const IppiSize &refBufferSize, Ipp32f *result, int resultStep)
{
   double cpuStart = second();

   IppStatus intelStatus = ippiCrossCorrValid_NormLevel_8u32f_C1R(offsetSourceBuffer,
      offsetSourceBufferStep,
      sourceBufferROISize,
      refBuffer,
      refBufferStep,
      refBufferSize,
      result,
      resultStep);

   double cpuEnd = second();
   std::cout << "CPU Execution time: " << (cpuEnd - cpuStart) * 1000 <<"ms" <<  std::endl;
}


const size_t sourceBufferSize = 409600;
const int srcBufferWidth = 1280;
const int srcBufferHeight = 160;

const int refBufferROIWidth = 640;
const int refBufferROIHeight = 101;
const size_t refBufferSize = 64640;

int offsetSourceBufferStep = 1280;
int refBufferStep = 640;

const size_t dstBufferSize = 153840;
const int dstBufferWidth = 641;
const int dstBufferHeight = 60;
int dstBufferStep = 2592;

void RunNpp()
{
   NppiSize srcBufferROISize;
   srcBufferROISize.width = srcBufferWidth;
   srcBufferROISize.height = srcBufferHeight;
   int ofstCount = sourceBufferSize / sizeof(Ipp8u);
   Npp8u *offsetSourceBuffer = new Npp8u[ofstCount];
   for (int i = 0; i < ofstCount; i++)
   {
      offsetSourceBuffer[i] = (Npp8u)i;
   }

   NppiSize refBufferROISize;
   refBufferROISize.width = refBufferROIWidth;
   refBufferROISize.height = refBufferROIHeight;
   int refBufferCount = refBufferSize / sizeof(Ipp8u);
   Npp8u *refBuffer = new Npp8u[refBufferCount];
   for (int i = 0; i < refBufferCount; i++)
   {
      refBuffer[i] = (Npp8u) i * 5;
   }

   GPUCorrelation(offsetSourceBuffer, sourceBufferSize, offsetSourceBufferStep,
      srcBufferROISize,
      refBuffer, refBufferSize, refBufferStep,
      refBufferROISize,
      dstBufferSize, dstBufferStep);

   delete[] offsetSourceBuffer;
   delete[] refBuffer;
}

void RunIpp()
{
   IppiSize srcBufferROISize;
   srcBufferROISize.width = srcBufferWidth;
   srcBufferROISize.height = srcBufferHeight;
   int ofstCount = sourceBufferSize / sizeof(Ipp8u);
   Ipp8u *offsetSourceBuffer = new Ipp8u[ofstCount];
   for (int i = 0; i < ofstCount; i++)
   {
      offsetSourceBuffer[i] = (Ipp8u)i;
   }

   IppiSize refBufferROISize;
   refBufferROISize.width = refBufferROIWidth;
   refBufferROISize.height = refBufferROIHeight;
   int refBufferCount = refBufferSize / sizeof(Ipp8u);
   Ipp8u *refBuffer = new Ipp8u[refBufferCount];
   for (int i = 0; i < refBufferCount; i++)
   {
      refBuffer[i] = (Ipp8u) i * 5;
   }

   Ipp32f *dstBuffer = ippiMalloc_32f_C1(dstBufferWidth, dstBufferHeight, &dstBufferStep);

   CPUCorrelation(offsetSourceBuffer, offsetSourceBufferStep,
      srcBufferROISize, refBuffer, refBufferStep,
      refBufferROISize, dstBuffer, dstBufferStep);

   delete[] offsetSourceBuffer;
   delete[] refBuffer;
   ippiFree(dstBuffer);
}

int main(void)
{
   RunNpp();

   RunIpp();
   std::cin.get();
}

I wish I could attach a photo of the profiler. The longest running kernel is

void ForEachPixelNaive<float, int=1, ValidCrossCorrNormLeve<Pixel<unsigned char, int=1>, Pixel<double, int=1>, AssignOperation<double, float>>>(Image<float, int=1>, NppiSize, unsigned char)

Robert_Crovella · May 16, 2015, 10:30pm

Apparently for smaller template sizes, NPP can be faster than IPP for this function. However this isn’t one of those cases. Anyway I’ve filed a performance bug with the team responsible for this library. They are aware of the issue. I don’t have any further details on it at this time.

One note about NPP benchmarking: The first call to any CUDA library function may involve a significant start-up overhead. Whether or not this factors into your comparison or not is your decision of course, however in this case I took your code and added an extra “warm-up” call to the main nppi function, immediately prior to the timing area, and it resulted in a significant (~2x) reduction in the execution time of the code in the timing area. Not enough to swing the balance in favor of NPP, however.

curtishx · May 18, 2015, 11:15am

Thanks for you help.

Is the “warm-up” call just any nnpi function, or is there a special / recommended function to get the initialization done?

Robert_Crovella · May 18, 2015, 2:36pm

For benchmarking purposes, call the function you intend to benchmark. For other use, be aware that the first time you call a function it may take longer to execute.

eightbit11 · May 25, 2022, 12:46am

Hi,
Is there any update?

cuda 11.2 + RTX3070 system still slow.

image size is 1754x6981
and template size is 1744x6971
result size is 11x11

function name is nppiCrossCorrValid_NormLevel_8u32f_C1R

IPP takes 200ms but NPP takes 600ms.

Robert_Crovella · May 25, 2022, 1:10am

The issue reported in this thread was addressed in the CUDA 8.0 timeframe. Note the statement about very large template sizes.

If you wish to report an issue, and have a complete test case, you can file a bug.

Topic		Replies	Views
NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R too slow??? CUDA Programming and Performance	8	1609	March 7, 2015
NPP NPP cross correlation CUDA Programming and Performance	3	5172	April 24, 2014
About inaccuracy in nppiCrossCorrValid_NormLevel() GPU-Accelerated Libraries npp	0	617	February 7, 2022
Are Nppi CrossCorrelation functions working? GPU-Accelerated Libraries	0	482	September 11, 2020
NPP_STEP_ERROR when using Cross correlation function from NPP CUDA Programming and Performance	0	1114	April 24, 2014
How to realise 1D Crosscorrelation between tow signals on GPU GPU-Accelerated Libraries npp	0	817	February 9, 2022
Normalized correlation image v.s. template CUDA Programming and Performance	1	398	January 24, 2022
2D cross correlation CUDA Programming and Performance	11	26253	May 19, 2011
help me help you with modern CMake and CUDA: mwe for NPP GPU-Accelerated Libraries	17	3072	February 11, 2021
Could someone give one code example - how to use nppiSqrDistanceValid_Norm_8u32f_C1R_Ctx? GPU-Accelerated Libraries npp	13	1364	October 21, 2021

Very poor performance with NPP CrossCorrValid

Related topics