Performance issues using vpiImageCreate*Wrapper


I’m having performance issues on JetPack 4.6.3 when using the VPI conversions of:

vpiImageCreateCUDAMemWrapper, vpiImageCreateNvBufferWrapper, and also NvBufferCreate

For the case of vpiImageCreateCUDAMemWrapper, if I create a VPIImageData with existing CUDA GPU memory, the time it takes to run this function takes around 5-7ms… However, if I use PINNED memory as an input, it takes around 500-700us… Should the performance be the same for both of these cases?? Also, 500-600us seems to be a LONG time considering it should just be wrapping the memory… When I analyze the output pointer from the VpiImage that vpiImageCreateCUDAMemWrapper returns by reference, it seems to create a new pointer from my GPU memory, however, it utilizes the same pointer if I use PINNED. Does this function only want PINNED memory and you can’t just keep everything in raw GPU CUDA memory? So if I give it a GPU pointer that is NOT pinned, then it will create pinned memory and copy the GPU data to it? This all seems odd to me…

For the case of vpiImageCreateNvBufferWrapper, the performance always seems to be 4-7ms. Which is a lot of time considering that it should be wrapping the memory…

For the NvBufferCreate, it can take upwards of around 1ms…

Why is the performance of these functions so poor? Based off the Performance Chart of different algorithms VPI performs, it performs a lot better than OpenCV, but there’s no use to use VPI if the time to wrap a pointer into a VpiImage object takes too long…


Which backend do you use? Do you use CUDA?
And which algorithm do you use?

It will be good if you can share a sample code with us to let us know more about your use case.



Here’s a sample code I produced and some comments in the main file with some issues I have and questions:
vpiPerformance.tar.gz (2.5 KB)

But let’s ignore the algorithms at the moment. I’m just talking about the performance issue of wrapping existing cudaMalloc pointers, CPU Pinned pointers, and DMA Buffer FD using the vpi wrapper functions. After that, we can talk about specific algorithm performance.

I think firstly, as mentioned in the first paragraph that when I give a cudaMalloc pointer and wrap it, it seems to create a new pointer underneath the hood instead of utilizing the existing one. If I use pinned, it uses the pointer properly but the function takes almost 1 ms.

I’m using the ALL backend and I have also tried the CUDA backend and they produce the same results.

Can you reread my question and answer some of the questions I had listed / some of the problems I listed and tell me if that should be the case?

Wanted to reply to see if there was any update on this?


Thanks for the source.

Since VPI has some newer releases (2.x), could you try if the same issue also occurs on JetPack 5.0.2?

Please noted that we are integrated all the image wrappers to vpiImageCreateWrapper.
Some change is required for your source to run with VPI 2.


@AastaLLL Unfortunately I cannot upgrade my Jetpack version on the devices at the moment. I have some hardware that needs the older version and they haven’t released drivers for the newer versions yet.

Unfortunately, I don’t have any spare Jetson devices that I can flash the newer JetPack to test. If you run this example code on JetPack, do you get the same results as I do? If it ends up being a bug, is there any way to fix or get around it?


We will test your source on VPI 2.1 to see if this issue occurs on our latest software first.

1 Like


Sorry for the update.

Have you tried to set the flag value when wrapping the buffer?

If a buffer is created by cudaMalloc, only GPU has access to it.
So if you create the wrapper with default all backend, VPI needs to handle the case for CPU backend and might need to copy the buffer

However, pinned memory can be accessed via both CPU and GPU.
So no memory copy is required.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.