Efficient VPI and NPP interop

hbalint · April 27, 2021, 8:02am

Hello everybody,

I’m getting really stuck with building an Argus → VPI → NPP interop chain, as I’m trying to implement [0].

Main question is that after allocating a VPIImage with CUDA memory, just like this:

VPIStatus initCudaBackedImage(VPIImage& image, Size2D<uint32_t> frame_size) {
  int image_data_pitch = -1;
  Npp8u* image_cuda_data = nppiMalloc_8u_C1(frame_size.width(), frame_size.height(), &image_data_pitch);
  if (image_cuda_data == nullptr) {
    throw std::runtime_error{"Cannot allocate CUDA memory"};
  }

  VPIImageData image_data;
  memset(&image_data, 0, sizeof(image_data));
  image_data.numPlanes = 1;
  image_data.format = VPI_IMAGE_FORMAT_U8;
  image_data.planes[0].width = frame_size.width();
  image_data.planes[0].height = frame_size.height();
  image_data.planes[0].pixelType = VPI_PIXEL_TYPE_U8;
  image_data.planes[0].pitchBytes = image_data_pitch;
  image_data.planes[0].data = image_cuda_data;

  return vpiImageCreateCUDAMemWrapper(&image_data, 0, &image);
}

How can you execute an NPP function?

So far, I can see two feasible options:

Have to lock the VPI image to get the CUDA pointer out - but that involves quite a boilerplate
Store the pointer next to the VPIImage - but then lose sync between VPI functions down in the line.

My main processing loops looks like this, after I’ve grabbed the frame from Argus:

    if (input_fd == -1) {
      input_fd = iImageNativeBuffer->createNvBuffer(
          iOutputStream->getResolution(),
          NvBufferColorFormat_NV12_ER,
          NvBufferLayout_Pitch
      );
    } else {
      iImageNativeBuffer->copyToNvBuffer(input_fd);
    }

    VPI_SAFE_CALL(vpiImageCreateNvBufferWrapper(input_fd, nullptr, 0, &input_image));
    VPI_SAFE_CALL(vpiSubmitConvertImageFormat(vpi_stream, VPI_BACKEND_CUDA, input_image, gray_image, nullptr));

    // Do here an e.g. nppiAlphaCompC_8u_C1R_Ctx call

    // Continoue with VPI rescale algorithms

    VPI_SAFE_CALL(vpiStreamSync(vpi_stream));

What’s the suggested way to interop this pieces?

Can we except that samples and documentation will be expanded with more interop examples (as VPI is not as featureful as VisionWorks)?

Thanks!

[0] - Background subtraction and object detection - ScienceDirect

AastaLLL · April 27, 2021, 9:18am

Hi,

You should be able to launch the NPP with image_cuda_data buffer directly.
Do you meet any issue for doing this?

Thanks.

hbalint · April 27, 2021, 7:13pm

There are I think two main questions in this regard:

Do you need to keep your hands on both the VPIImage, and your CUDA allocated memory in the same time, or there are solutions to access the underlying memory easier (e.g. host functions)?

Do you need to execute the parts of the pipeline in segments, so you do the following:

// Execute VPI function 1
// VPI Synchronization

// Execute CUDA function 1
// CUDA Synchronization

// Execute VPI function 2
// VPI Synchronization

// ad nauseam...
//

Or VPI can (as I saw, it can wrap the CUDA streams), execute mixed VPI/CUDA functions, and synchronize once.

AastaLLL · May 7, 2021, 5:25am

Hi,

1.
The buffer pointer from VPIImage can be accessed as following:

VPIImage image;
...

VPIImageData data;
CHECK_STATUS(vpiImageLock(image, VPI_LOCK_READ_WRITE, &data));
std::cout << &data.planes[0].data << std::endl;
CHECK_STATUS(vpiImageUnlock(image));

2.
You can mixed VPI/CUDA functions.
But similar to the CUDA tasks, please attach all the jobs to the same CUDA stream.

Blow is the way to wrap cudaStream_t to a VPIStream for your reference:
https://docs.nvidia.com/vpi/group__VPI__CUDAInterop.html#gad17561e640b20a13e47ef27687749982

Thanks.