VPI performance, how to link to gstreamer pipeline

Hi, I am running a simple VPI pipeline that has the following operations: image rescale, perspective warp, and finally convert image format. I want to have these operations run in real-time on a camera stream, and encode the output to H264 to send it out over the network.

I’ve attached a bare version of the code that I use for this pipeline. I use libArgus to capture the camera frames, then wrap the buffer into a VPIImage, and perform the operations, where I use VPIEvents to do some timing.

I am not getting the performance I expected, for example: The perspective warp operation typically takes >3ms, whereas based on the performance benchmarks listed here VPI - Vision Programming Interface: Perspective Warp I would be expecting <1ms (Jetson Nano, CUDA backend, image: 1920x1080 / NV12ER, linear interpolation).
Also the image format conversion operation seems to take too long (4-5ms instead of 1-2ms).
I am not sure where this problem comes from, could it be that the VPIEvents lead to some overhead? Interesting to note is that when I do the same operations in the benchmarking code from here: VPI - Vision Programming Interface: Benchmarking the timings do correspond to those given in the performance tables in the documentation. The only clear difference I see that some tests are batched on timings are averaged there, such that less events are recorded.

In addition, I would like to pipe the output of these operation to a h264 encoder, preferably through gStreamer. What is the best way to do this as efficiently as possible? I have tried copying that data into a cv::Mat (as in the attached code, except I use imshow for debugging there), and then feeding it to a gStreamer pipeline but this is slow. I suspect because of CPU<->GPU memory copies.
main.cpp (9.5 KB)

You may try to start from this topic.

Hi,

The table is generated in a batch manner.
So the performance is more like throughput result rather than latency.

In case you don’t know, have you maximized the performance as mentioned below first?
https://docs.nvidia.com/vpi/algo_performance.html#maxout_clocks

Thanks.

I did run this script to maximize the performance before running my program, and I do get correct results when I run the benchmarking programs so I assume that means my Nano runs at the proper speed.

I am not sure what exactly you mean by “the performance is more like throughput result rather than latency”.
Does this mean these timings are not achievable in regular applications?
I did notice that in the benchmarking code the operations are run in a batch, with the timing events recording the time taken for 50 consecutive operations, and then dividing the result by 50 to get the average time per image operation.
I was wondering if this means that there is some unintended overhead included (like the VPIEvents themselves? Or some memory allocation?) that skews my results when I just time a single operation?

Hi @jwjw,

You might want to check RidgeRun’s Gstreamer plugin to use VPI algorithms. This might ease the process of the h264 encoding, since all would be processed within the Gstreamer environment.

https://developer.ridgerun.com/wiki/index.php?title=NVIDIA_VPI_GStreamer_Plug-in

We have an evaluation version in case you want to try it.

Regards,
Jimena Salas