I have a large gstreamer/Deepstream application that utilizes CUDA, NVENC and VIC among others, using both Nvidia and custom plugins. However, accelerator performance seems to be slower when running as part of that large application.

When I run and profile a small application at the same time the large app is running, the small app can achieve the maximum performance of accelerators.
Example: I use VPI OpticalFlowDense (in custom plugin) and nvv4l2h264enc plugin in the large app, where it gets 18 ms in NVENC tasks for both video encoding and the optical flow. I ran a small app that uses the same VPI OpticalFlowDense, with same parameters, directly without a pipeline, it gets 4 ms.

I’ve managed to capture both running at the same time to an Nsight Systems timeline:

The short tasks on NVENC1 are the small app’s optical flow calls, the long tasks on both NVENCs are the large app’s video encode and optical flow tasks.

I’m 99% sure it’s an issue on my part, but I’m hoping for some suggestions for things to try.

Additional information:

I have so far tried:

  • sudo renice -n -10 <pid of large app>
  • Queues to split the pipeline to threads
  • jetson_clocks

Both apps (and custom plugins) are compiled using -O0 (no optimization)

Screenshots of timeline of VPI Work queue threads that run the OpticalFlowDense. (Note that these are running simultaneosly, but are different UNIX processes)
Large app:

Small app:

Please apply this to run VIC engine at max clock:
Nvvideoconvert issue, nvvideoconvert in DS4 is better than Ds5? - #3 by DaneLLL

and there is property in nvv4l2h264enc:

  maxperf-enable      : Enable or Disable Max Performance mode
                        flags: readable, writable, changeable only in NULL or READY state
                        Boolean. Default: false

Please enable the property and check if the performance gets better.

Besides, for maximum throughput of GPU, please execute sudo nvpmodel -m 2 and sudo jetson_clocks. This is 15W mode and you may also try 20W mode.

