Nvvidconv slow in multi-process: why does performance collapse when using multiple GStreamer processes?

Hello !

I am seeing a performance issue with nvvidconv on Jetson AGX Thor when running multiple GStreamer pipelines in parallel.

I tested two equivalent scenarios:

1) Multi-process

I launch ~30 identical pipelines, each in its own process:

gst-launch-1.0 filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA ! fakesink sync=false

with this script :

#!/bin/bash

VIDEO="test_500.mp4"
N=30

PIPELINE='gst-launch-1.0 filesrc location='"$VIDEO"' ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false'

echo "Starting $N pipelines in parallel..."

for i in $(seq 1 $N); do
    echo "Launching pipeline $i"
    bash -c "$PIPELINE" &
done

wait

2) Single-process, multi-pipeline

I launch one single gst-launch-1.0 containing ~30 identical pipelines (concatenated in the same command line).

gst-launch-1.0 filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv  ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false

In both cases, the same amount of video is decoded and converted.

However, the performance is dramatically different:

  • In single-process, performance is high, the VIC runs at ~1.1 GHz, and throughput is good.

Thanks to perf plugin i see i’m around 60fps for each pipeline

Each branch reach : perf: perf2; timestamp: 163:17:25.433509254; bps: 39168.000; mean_bps: 25632.000; fps: 67.787; mean_fps: 58.199

  • In multi-process, performance collapses, the VIC stays at ~225 MHz, and overall throughput is much lower, even when clocks are forced (jetson_clocks, nvpmodel).

Thanks to perf plugin i see i’m around 6-9fps for each pipeline

perf: perf0; timestamp: 163:18:12.276524395; bps: 8064.000; mean_bps: 2419.200; fps: 9.676; mean_fps: 10.473

My questions are:

Why does nvvidconv / NvBufSurfTransform perform so much worse when used from multiple processes, even though the workload is the same?

Is this related to CUDA contexts, NVMM buffer management, or inter-process synchronization (fences, driver locking, etc.)?

And more importantly:

Is there a supported way to use nvvidconv efficiently in multi-process, or is the NVIDIA video pipeline fundamentally designed for a single-process multi-pipeline architecture?

Thank you in advance for any insight.

Best regards

Hi,
For information, do you use Jetpack 7.1GA? Would like to confirm you observe it on latest Jetpack 7 release.

Hi Dane,

Thanks for your reply. Yes, we can confirm that we are observing this behavior on the latest Jetpack 7 release as well.

Best regards