Improving GSTreamer and OpenCV performance in Nvidia Jetson TX2

Hey! This is my second post about this, I decided to move from this since my program it’s already working just fine and I’m only trying to improve it’s performance.

I’ve been trying to develop an app that receives and RTSP stream from a camera, process it in OpenCV and then stream it through UDP to a client. I need the output stream of this to be at least 25 stable FPS, which I have archived with a 720p stream from the camera, but I’d like to attempt increasing the quality to 1080p.

The pipelines I’m using right now are:

 camera.open("rtspsrc location=rtsp://192.168.0.72:8554/video latency=0 ! application/x-rtp, media=video, encoding-name=H264, clock-rate=90000, payload=96 ! queue max-size-buffers=0 max-size-bytes=0 max-size-time=10 ! queue max-size-time=1 min-threshold-time=5 ! rtph264depay ! video/x-h264, stream-format=byte-stream, framerate=30/1 ! h264parse ! video/x-h264, stream-format=byte-stream, framerate=30/1 ! omxh264dec output-buffers=16 ! video/x-raw(memory:NVMM), format=NV12, framerate=30/1 ! nvvidconv output-buffers=30 ! video/x-raw, format=BGRx ! videoconvert ! video/x-raw, format=BGR ! queue ! appsink ");
 pipeline = "appsrc ! queue ! videoconvert ! video/x-raw,width=" + to_string(img.cols) + ",height=" + to_string(img.rows) + ",framerate=" + to_string(num_fps) + "/1 ! nvvidconv ! video/x-raw(memory:NVMM) ! nvv4l2vp8enc iframeinterval=30 control-rate=0  bitrate=30000000 preset-level=0 maxperf-enable=true ! rtpvp8pay pt=100 ! udpsink host=224.1.1.1 port=5000 auto-multicast=true sync=false async=false";
 writer.open(pipeline, cv::CAP_GSTREAMER, 0, num_fps, cv::Size(img.cols, img.rows));

And this work fine for streaming the 720p video, I get 25 stable fps, but if I try to stream 1080 it jumps down to 12 fps and I haven’t been able to increase it. I thought maybe I had reached the boards limit and it just wasn’t possible to get more out of it using OpenCV (I know without opencv it can stream even 4k) but the cpu / gpu usage seems to be quite low during the programs execution according to tegrastats:

AM 4196/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [64%@1985,100%@2034,87%@2035,69%@1986,66%@1990,64%@1986] EMC_FREQ 17%@1866 GR3D_FREQ 32%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 1% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35.5C BCPU@37C thermal@36C Tdiode@33.75C VDD_SYS_GPU 1678/1678 VDD_SYS_SOC 1983/1983 VDD_4V0_WIFI 0/0 VDD_IN 10718/10718 VDD_SYS_CPU 3050/3050 VDD_SYS_DDR 2242/2242
RAM 4196/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [46%@1952,100%@2033,81%@2029,48%@1952,45%@1955,39%@1956] EMC_FREQ 17%@1866 GR3D_FREQ 24%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35C BCPU@37C thermal@36.2C Tdiode@34.25C VDD_SYS_GPU 1678/1678 VDD_SYS_SOC 1907/1945 VDD_4V0_WIFI 0/0 VDD_IN 10226/10472 VDD_SYS_CPU 2821/2935 VDD_SYS_DDR 2147/2194
RAM 4197/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [29%@1991,100%@2034,53%@2034,32%@1989,28%@1993,33%@1993] EMC_FREQ 15%@1866 GR3D_FREQ 40%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@36.5C MCPU@36.5C PMIC@100C Tboard@32C GPU@34.5C BCPU@36.5C thermal@35.7C Tdiode@33.25C VDD_SYS_GPU 1221/1525 VDD_SYS_SOC 1755/1881 VDD_4V0_WIFI 0/0 VDD_IN 8589/9844 VDD_SYS_CPU 2136/2669 VDD_SYS_DDR 1862/2083
RAM 4196/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [58%@2017,100%@2035,74%@2034,56%@2018,62%@2021,62%@2019] EMC_FREQ 16%@1866 GR3D_FREQ 18%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35C BCPU@37C thermal@36.2C Tdiode@34.25C VDD_SYS_GPU 1525/1525 VDD_SYS_SOC 1907/1888 VDD_4V0_WIFI 0/0 VDD_IN 10226/9939 VDD_SYS_CPU 2897/2726 VDD_SYS_DDR 2185/2109
RAM 4196/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [70%@2019,100%@2035,83%@2034,75%@2017,72%@2015,73%@2018] EMC_FREQ 19%@1866 GR3D_FREQ 24%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35C BCPU@37C thermal@36.7C Tdiode@34.25C VDD_SYS_GPU 1602/1540 VDD_SYS_SOC 1983/1907 VDD_4V0_WIFI 0/0 VDD_IN 10718/10095 VDD_SYS_CPU 3050/2790 VDD_SYS_DDR 2299/2147
RAM 4196/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [49%@2035,100%@2035,77%@2035,48%@2035,50%@2034,51%@2036] EMC_FREQ 18%@1866 GR3D_FREQ 9%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35C BCPU@37C thermal@36.2C Tdiode@34.25C VDD_SYS_GPU 1449/1525 VDD_SYS_SOC 1907/1907 VDD_4V0_WIFI 0/0 VDD_IN 10150/10104 VDD_SYS_CPU 2897/2808 VDD_SYS_DDR 2147/2147
RAM 4197/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [41%@2020,100%@2035,67%@2034,48%@2023,47%@2023,51%@2020] EMC_FREQ 17%@1866 GR3D_FREQ 29%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35.5C BCPU@37C thermal@36.7C Tdiode@34.25C VDD_SYS_GPU 1373/1503 VDD_SYS_SOC 1907/1907 VDD_4V0_WIFI 0/0 VDD_IN 9654/10040 VDD_SYS_CPU 2670/2788 VDD_SYS_DDR 2071/2136
RAM 4197/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [37%@2037,100%@2035,64%@2035,40%@2035,42%@2034,47%@2033] EMC_FREQ 15%@1866 GR3D_FREQ 21%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35C BCPU@37C thermal@36.2C Tdiode@34.25C VDD_SYS_GPU 1526/1506 VDD_SYS_SOC 1907/1907 VDD_4V0_WIFI 0/0 VDD_IN 9654/9991 VDD_SYS_CPU 2670/2773 VDD_SYS_DDR 2052/2125
RAM 4198/7852MB (lfb 406x4MB) SWAP 0/3926MB (cached 0MB) CPU [42%@2018,100%@2035,57%@2035,41%@2021,39%@2025,55%@2025] EMC_FREQ 15%@1866 GR3D_FREQ 7%@1300 NVENC 1164 NVDEC 1203 APE 150 MTS fg 0% bg 0% PLL@37C MCPU@37C PMIC@100C Tboard@32C GPU@35.5C BCPU@37C thermal@36.4C Tdiode@33.75C VDD_SYS_GPU 1373/1491 VDD_SYS_SOC 1908/1907 VDD_4V0_WIFI 0/0 VDD_IN 9463/9933 VDD_SYS_CPU 2517/2745 VDD_SYS_DDR 2052/2117

Is there a way to improve performance further? In the future if this app works it will be the only thing running on the board so I would like to dump all of the boards resources into the program.

I was told in my last post that maybe removing appsrc and appsink by using GstBuffers and transforming that to opencv was faster but I have no clue in how to do that (I’ve been reading about it but I’m still pretty confused).

Also, if it helps, the bottleneck seems to be in appsink since a transcoding pipeline of all this works just fine at +35fps for 1080p stream, the performance issues come when applying all this to OpenCV

I’m using:

  • NVIDIA Jetson TX2
  • L4T 32.3.1 [ JetPack 4.3 ]
  • Ubuntu 18.04.4 LTS
  • Kernel Version: 4.9.140-tegra
  • CUDA 10.0.326
  • OpenCV 4.1.1
  • GStreamer 1.14.5-0ubuntu1~18.04.1

and I’ve run the jetson_clocks scripts and nvpmodel -m 0 but it doesn’t seem to change much in performance

Hi,
The suggestion it to run a gstreamer pipeline and call NvBuffer APIs to get cv::cuda::GpuMat

You can also get cv::Mat. Please refer to

1 Like

This was really helpful, thank you!