Performance issues using vpiImageCreate*Wrapper

moezruo6 · January 10, 2023, 4:54pm

Hey!

I’m having performance issues on JetPack 4.6.3 when using the VPI conversions of:

vpiImageCreateCUDAMemWrapper, vpiImageCreateNvBufferWrapper, and also NvBufferCreate

For the case of vpiImageCreateCUDAMemWrapper, if I create a VPIImageData with existing CUDA GPU memory, the time it takes to run this function takes around 5-7ms… However, if I use PINNED memory as an input, it takes around 500-700us… Should the performance be the same for both of these cases?? Also, 500-600us seems to be a LONG time considering it should just be wrapping the memory… When I analyze the output pointer from the VpiImage that vpiImageCreateCUDAMemWrapper returns by reference, it seems to create a new pointer from my GPU memory, however, it utilizes the same pointer if I use PINNED. Does this function only want PINNED memory and you can’t just keep everything in raw GPU CUDA memory? So if I give it a GPU pointer that is NOT pinned, then it will create pinned memory and copy the GPU data to it? This all seems odd to me…

For the case of vpiImageCreateNvBufferWrapper, the performance always seems to be 4-7ms. Which is a lot of time considering that it should be wrapping the memory…

For the NvBufferCreate, it can take upwards of around 1ms…

Why is the performance of these functions so poor? Based off the Performance Chart of different algorithms VPI performs, it performs a lot better than OpenCV, but there’s no use to use VPI if the time to wrap a pointer into a VpiImage object takes too long…

AastaLLL · January 11, 2023, 3:23am

Hi,

Which backend do you use? Do you use CUDA?
And which algorithm do you use?

It will be good if you can share a sample code with us to let us know more about your use case.

Thanks

moezruo6 · January 11, 2023, 5:52am

@AastaLLL

Here’s a sample code I produced and some comments in the main file with some issues I have and questions:
vpiPerformance.tar.gz (2.5 KB)

But let’s ignore the algorithms at the moment. I’m just talking about the performance issue of wrapping existing cudaMalloc pointers, CPU Pinned pointers, and DMA Buffer FD using the vpi wrapper functions. After that, we can talk about specific algorithm performance.

I think firstly, as mentioned in the first paragraph that when I give a cudaMalloc pointer and wrap it, it seems to create a new pointer underneath the hood instead of utilizing the existing one. If I use pinned, it uses the pointer properly but the function takes almost 1 ms.

I’m using the ALL backend and I have also tried the CUDA backend and they produce the same results.

Can you reread my question and answer some of the questions I had listed / some of the problems I listed and tell me if that should be the case?

moezruo6 · January 18, 2023, 12:34am

Wanted to reply to see if there was any update on this?

AastaLLL · January 18, 2023, 6:54am

Hi,

Thanks for the source.

Since VPI has some newer releases (2.x), could you try if the same issue also occurs on JetPack 5.0.2?

Please noted that we are integrated all the image wrappers to vpiImageCreateWrapper.
Some change is required for your source to run with VPI 2.

Thanks.

moezruo6 · January 18, 2023, 2:39pm

@AastaLLL Unfortunately I cannot upgrade my Jetpack version on the devices at the moment. I have some hardware that needs the older version and they haven’t released drivers for the newer versions yet.

Unfortunately, I don’t have any spare Jetson devices that I can flash the newer JetPack to test. If you run this example code on JetPack, do you get the same results as I do? If it ends up being a bug, is there any way to fix or get around it?

AastaLLL · January 19, 2023, 7:55am

Hi,

We will test your source on VPI 2.1 to see if this issue occurs on our latest software first.
Thanks.

AastaLLL · February 9, 2023, 6:21am

Hi,

Sorry for the update.

Have you tried to set the flag value when wrapping the buffer?
https://docs.nvidia.com/vpi/1.2/group__VPI__CUDAInterop.html#ga1a11df1d1033178b373c9897213264d0

If a buffer is created by cudaMalloc, only GPU has access to it.
So if you create the wrapper with default all backend, VPI needs to handle the case for CPU backend and might need to copy the buffer

However, pinned memory can be accessed via both CPU and GPU.
So no memory copy is required.

Thanks.

system · March 7, 2023, 7:07am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.