Efficient VPI and NPP interop

Hello everybody,

I’m getting really stuck with building an Argus → VPI → NPP interop chain, as I’m trying to implement [0].

Main question is that after allocating a VPIImage with CUDA memory, just like this:

VPIStatus initCudaBackedImage(VPIImage& image, Size2D<uint32_t> frame_size) {
  int image_data_pitch = -1;
  Npp8u* image_cuda_data = nppiMalloc_8u_C1(frame_size.width(), frame_size.height(), &image_data_pitch);
  if (image_cuda_data == nullptr) {
    throw std::runtime_error{"Cannot allocate CUDA memory"};

  VPIImageData image_data;
  memset(&image_data, 0, sizeof(image_data));
  image_data.numPlanes = 1;
  image_data.format = VPI_IMAGE_FORMAT_U8;
  image_data.planes[0].width = frame_size.width();
  image_data.planes[0].height = frame_size.height();
  image_data.planes[0].pixelType = VPI_PIXEL_TYPE_U8;
  image_data.planes[0].pitchBytes = image_data_pitch;
  image_data.planes[0].data = image_cuda_data;

  return vpiImageCreateCUDAMemWrapper(&image_data, 0, &image);

How can you execute an NPP function?

So far, I can see two feasible options:

  • Have to lock the VPI image to get the CUDA pointer out - but that involves quite a boilerplate
  • Store the pointer next to the VPIImage - but then lose sync between VPI functions down in the line.

My main processing loops looks like this, after I’ve grabbed the frame from Argus:

    if (input_fd == -1) {
      input_fd = iImageNativeBuffer->createNvBuffer(
    } else {

    VPI_SAFE_CALL(vpiImageCreateNvBufferWrapper(input_fd, nullptr, 0, &input_image));
    VPI_SAFE_CALL(vpiSubmitConvertImageFormat(vpi_stream, VPI_BACKEND_CUDA, input_image, gray_image, nullptr));

    // Do here an e.g. nppiAlphaCompC_8u_C1R_Ctx call

    // Continoue with VPI rescale algorithms


What’s the suggested way to interop this pieces?

Can we except that samples and documentation will be expanded with more interop examples (as VPI is not as featureful as VisionWorks)?


[0] - Background subtraction and object detection - ScienceDirect


You should be able to launch the NPP with image_cuda_data buffer directly.
Do you meet any issue for doing this?


There are I think two main questions in this regard:

Do you need to keep your hands on both the VPIImage, and your CUDA allocated memory in the same time, or there are solutions to access the underlying memory easier (e.g. host functions)?

Do you need to execute the parts of the pipeline in segments, so you do the following:

// Execute VPI function 1
// VPI Synchronization

// Execute CUDA function 1
// CUDA Synchronization

// Execute VPI function 2
// VPI Synchronization

// ad nauseam...

Or VPI can (as I saw, it can wrap the CUDA streams), execute mixed VPI/CUDA functions, and synchronize once.


The buffer pointer from VPIImage can be accessed as following:

VPIImage image;

VPIImageData data;
CHECK_STATUS(vpiImageLock(image, VPI_LOCK_READ_WRITE, &data));
std::cout << &data.planes[0].data << std::endl;

You can mixed VPI/CUDA functions.
But similar to the CUDA tasks, please attach all the jobs to the same CUDA stream.

Blow is the way to wrap cudaStream_t to a VPIStream for your reference: