Hello !
I am seeing a performance issue with nvvidconv on Jetson AGX Thor when running multiple GStreamer pipelines in parallel.
I tested two equivalent scenarios:
1) Multi-process
I launch ~30 identical pipelines, each in its own process:
gst-launch-1.0 filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! video/x-raw(memory:NVMM),format=RGBA ! fakesink sync=false
with this script :
#!/bin/bash
VIDEO="test_500.mp4"
N=30
PIPELINE='gst-launch-1.0 filesrc location='"$VIDEO"' ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false'
echo "Starting $N pipelines in parallel..."
for i in $(seq 1 $N); do
echo "Launching pipeline $i"
bash -c "$PIPELINE" &
done
wait
2) Single-process, multi-pipeline
I launch one single gst-launch-1.0 containing ~30 identical pipelines (concatenated in the same command line).
gst-launch-1.0 filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false filesrc location=test_500.mp4 ! qtdemux ! h264parse ! nvv4l2decoder ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! perf ! fakesink sync=false
In both cases, the same amount of video is decoded and converted.
However, the performance is dramatically different:
- In single-process, performance is high, the VIC runs at ~1.1 GHz, and throughput is good.
Thanks to perf plugin i see i’m around 60fps for each pipeline
Each branch reach : perf: perf2; timestamp: 163:17:25.433509254; bps: 39168.000; mean_bps: 25632.000; fps: 67.787; mean_fps: 58.199
- In multi-process, performance collapses, the VIC stays at ~225 MHz, and overall throughput is much lower, even when clocks are forced (jetson_clocks, nvpmodel).
Thanks to perf plugin i see i’m around 6-9fps for each pipeline
perf: perf0; timestamp: 163:18:12.276524395; bps: 8064.000; mean_bps: 2419.200; fps: 9.676; mean_fps: 10.473
My questions are:
Why does nvvidconv / NvBufSurfTransform perform so much worse when used from multiple processes, even though the workload is the same?
Is this related to CUDA contexts, NVMM buffer management, or inter-process synchronization (fences, driver locking, etc.)?
And more importantly:
Is there a supported way to use nvvidconv efficiently in multi-process, or is the NVIDIA video pipeline fundamentally designed for a single-process multi-pipeline architecture?
Thank you in advance for any insight.
Best regards