I’m having performance issues on JetPack 4.6.3 when using the VPI conversions of:
vpiImageCreateCUDAMemWrapper, vpiImageCreateNvBufferWrapper, and also NvBufferCreate
For the case of vpiImageCreateCUDAMemWrapper, if I create a VPIImageData with existing CUDA GPU memory, the time it takes to run this function takes around 5-7ms… However, if I use PINNED memory as an input, it takes around 500-700us… Should the performance be the same for both of these cases?? Also, 500-600us seems to be a LONG time considering it should just be wrapping the memory… When I analyze the output pointer from the VpiImage that vpiImageCreateCUDAMemWrapper returns by reference, it seems to create a new pointer from my GPU memory, however, it utilizes the same pointer if I use PINNED. Does this function only want PINNED memory and you can’t just keep everything in raw GPU CUDA memory? So if I give it a GPU pointer that is NOT pinned, then it will create pinned memory and copy the GPU data to it? This all seems odd to me…
For the case of vpiImageCreateNvBufferWrapper, the performance always seems to be 4-7ms. Which is a lot of time considering that it should be wrapping the memory…
For the NvBufferCreate, it can take upwards of around 1ms…
Why is the performance of these functions so poor? Based off the Performance Chart of different algorithms VPI performs, it performs a lot better than OpenCV, but there’s no use to use VPI if the time to wrap a pointer into a VpiImage object takes too long…
Here’s a sample code I produced and some comments in the main file with some issues I have and questions: vpiPerformance.tar.gz (2.5 KB)
But let’s ignore the algorithms at the moment. I’m just talking about the performance issue of wrapping existing cudaMalloc pointers, CPU Pinned pointers, and DMA Buffer FD using the vpi wrapper functions. After that, we can talk about specific algorithm performance.
I think firstly, as mentioned in the first paragraph that when I give a cudaMalloc pointer and wrap it, it seems to create a new pointer underneath the hood instead of utilizing the existing one. If I use pinned, it uses the pointer properly but the function takes almost 1 ms.
I’m using the ALL backend and I have also tried the CUDA backend and they produce the same results.
Can you reread my question and answer some of the questions I had listed / some of the problems I listed and tell me if that should be the case?
@AastaLLL Unfortunately I cannot upgrade my Jetpack version on the devices at the moment. I have some hardware that needs the older version and they haven’t released drivers for the newer versions yet.
Unfortunately, I don’t have any spare Jetson devices that I can flash the newer JetPack to test. If you run this example code on JetPack, do you get the same results as I do? If it ends up being a bug, is there any way to fix or get around it?
If a buffer is created by cudaMalloc, only GPU has access to it.
So if you create the wrapper with default all backend, VPI needs to handle the case for CPU backend and might need to copy the buffer
However, pinned memory can be accessed via both CPU and GPU.
So no memory copy is required.