Debayering in NPPI slows down after ~900 frames, when using cudaHostAlloc.

I’ve been using this to measure nppi performance on a Jetson TX2.

#include <stdlib.h>
#include <string.h>
#include <iostream>
#include <chrono>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>

#include <cuda_runtime_api.h>
#include <cuda.h>
#include <nppi.h>

int
main(int argc, char *argv[])
{
  const int NFRAMES = 10000;
  const int w = 4024;
  const int h = 3036;
  const int bayer_size = w * h * 1;
  const int rgb_size = w * h * 4;
  NppStatus ret;
  char *src;
  char *cuda_src;
  char *cuda_dst;
  cudaError_t cudaerr;
  int src_line_step = w * 1;
  NppiSize src_size = {w, h};
  NppiRect src_roi = {0, 0, w, h};
  int dst_line_step = w * 4;

  int fd = open("/dev/urandom", 0);
  src = (char *) mmap(NULL, bayer_size, PROT_READ, 0, fd, 0);

  cudaerr = cudaHostAlloc((void **) &cuda_src, bayer_size, cudaHostAllocMapped);
  if (cudaerr) {
    printf("cudaHostAlloc 1 failed with error code %d.\n", cudaerr);
    exit(1);
  }
  cudaerr = cudaHostAlloc((void **) &cuda_dst, rgb_size, cudaHostAllocMapped);
  if (cudaerr) {
    printf("cudaHostAlloc 2 failed with error code %d.\n", cudaerr);
    exit(1);
  }

  cudaMemcpy(cuda_src, src, bayer_size, cudaMemcpyDefault);

  printf("Processing %d frames...\n", NFRAMES);
  auto start = std::chrono::steady_clock::now();
  decltype(start) last = std::chrono::steady_clock::now();
  for (int i = 0; i < NFRAMES; ++i) {
    ret = nppiCFAToRGBA_8u_C1AC4R((const Npp8u *) cuda_src,
                                  src_line_step,
                                  src_size,
                                  src_roi,
                                  (Npp8u *) cuda_dst,
                                  dst_line_step,
                                  NPPI_BAYER_RGGB,
                                  NPPI_INTER_UNDEFINED,
                                  255);
    if (ret) {
      printf("Error %d in NPP\n", ret);
      exit(1);
    }

    if (i % 100 == 0) {
      auto now = std::chrono::steady_clock::now();
      std::chrono::duration<double> d = now-last;
      printf("i: %d (time: %f ms)\n", i, d.count() * 1000);
      last=now;
    }
  }
  auto end = std::chrono::steady_clock::now();

  std::chrono::duration<double> diff = end - start;
  printf("Time: %f\n", diff.count());
  printf("Frame Time: %f\n", diff.count() / NFRAMES);
  printf("Effective Frame Rate: %f\n", 1 / (diff.count() / NFRAMES));

  return 0;
}

If you run this, you will see that after around 900 frames, the process slows down immensely (like thousands of times slower). If I use cudaMalloc instead of cudaHostAlloc, performance is constant.

To give you some context, the reason I started trying cudaHostAlloc was because in my application, copying to and from cuda was the main bottleneck and most of the program’s time was actually spent copying, rather than debayering. I was hoping I could use memory mapped buffers to get around the extra copies.

You’re observing the effect of the asynchronous launch queue.

Your loop will spin through the first launch operations without waiting for the GPU to complete its work. These launches go into a queue. When the queue fills, the launches are no longer asynchronous, and they wait for a queue slot to open up. At this point you witness the slowdown.

On an ordinary discrete GPU, when you use cudaHostAlloc for the data arrays, the nppi function call runs much more slowly. When you use cudaMalloc for the data arrays, the nppi function call runs much faster. On a TX2, however, due to the physically unified memory, there should be less difference in execution time between the two.

If you don’t want to see the slowdown at ~900 frames, put a cudaDeviceSynchronize() in your loop. That will force the previous work to complete. Then the loop time will accurately reflect the time it takes to process a frame, right from the start.

Another way to think about this is just compare overall time to complete your code in each case here. There is no magic. Your code as shown completes much more quickly when you use cudaMalloc instead of cudaHostAlloc

Thanks for the explanation. What I have trouble understanding, is why copying from cuda memory and back takes so much more time than normal memcpy, when on a TX2 the GPU and the CPU share the same physical memory and in the end, copying to and from should take about as much as it takes to perform a normal memcpy.

Also, can you tell me what exactly is the difference between the memory allocated using cudaHostAlloc and cudaMalloc?

Yes, sorry, I forgot you were on a TX2. In that case, cudaHostAlloc should provide “fast” memory. I’ve edited my previous comments to try and correct my statements.

However, the behavior I described is still what is happening. You are measuring the time it takes to fill the queue (in the first 900 iterations) rather than the time it takes to complete the algorithm.

On a TX2 the time it takes to complete the nppi call will be longer than 220us.

I suggest you put a cudaDeviceSynchronize() in the loop. This will force the loop to proceed only at the rate that the GPU can complete execution of kernels. You can then easily compare the behavior with different scenarios to see what works best.

A pointer returned by cudaMalloc is only accessible in device code. A pointer returned by cudaHostAlloc is usable/accessible in both host and device code. In TX2 that is the key difference. There are other differences with respect to GPU caching behavior, but if you’d like a full explanation of those, I suggest googling or posting your question on the TX2 forum.