How can I customize to use non-blocking mode nvv4l2h264enc

I need to encode 8-camera 1920*1080@30Hz video in parallel, and found its time cost largely increase from 10ms, 13ms, 14ms, 23ms, …, 38ms as 1 camera, 2 cameras, 3 cameras, 4 cameras, …, 8cameras. The gst pipeline is as following:

const char *gst_cmd_format =
        "appsrc name=appsrc"
        " ! video/x-raw,format=%s,width=%d,height=%d,framerate=(fraction)%d/%d"
        " ! nvvidconv"
        " ! video/x-raw(memory:NVMM),format=NV12,width=%d,height=%d,"
        " framerate=(fraction)%d/%d"
        " ! nvv4l2h264enc control-rate=constant_bitrate bitrate=%d"
        /**
         * All IDR frame configuration
         *   all-iframe, self-implemented param
         *   insert-sps-pps is needed
         * maxperf-enable: time-cost configuration
         */
        " all-iframe=true insert-sps-pps=true maxperf-enable=true"
        " ! appsink name=appsink";

I think it’s related to “Opening in BLOCKING MODE”, and as mentioned in Disable blocking mode in encoding with opencv-gstreamer and nvv4l2h264enc - #3 by DaneLLL, I want to continue customize gst-v4l2 after our all-iframe impelmentation. Could you help to give some reference?

Hi,
In the pipeline it requires high CPU usage to copy frame data from CPU buffer to NVMM buffer, so the CPU capability may cap the performance. Pease execute sudo nvpmodel -m 2 and sudo jetson_clocks to enable all CPU cores running at max clock. Can execute sudo tegrastats to check system loading

The capability of hardware encoder is listed in
https://developer.nvidia.com/jetson-xavier-nx-data-sheet
It can achieve 10x1080p30, so the performance may not be capped by hardware encoder. It is more like that CPU capability caps the performance.

Thanks for you quick reply.

Actually we have already make such configuration in the Jetson Xavier NX platform

nvidia@nvidia-desktop:~$ jetson_release 
 - NVIDIA Jetson Xavier NX (Developer Kit Version)
   * Jetpack 4.4 [L4T 32.4.3]
   * NV Power Mode: MODE_15W_6CORE - Type: 2
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.89
   * cuDNN: 8.0.0.180
   * TensorRT: 7.1.3.0
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
   * VPI: 0.3.7
   * Vulkan: 1.2.70
nvidia@nvidia-desktop:~$ tegrastats
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [16%@1420,11%@1420,15%@1420,11%@1420,9%@1420,12%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50C VDD_IN 7829/5505 VDD_CPU_GPU_CV 1349/820 VDD_SOC 4001/2619
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [18%@1420,9%@1420,16%@1420,15%@1420,12%@1420,13%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49C GPU@49.5C PMIC@100C AUX@50C CPU@50C thermal@50C VDD_IN 7829/5515 VDD_CPU_GPU_CV 1349/822 VDD_SOC 4001/2625
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [19%@1420,6%@1420,15%@1420,11%@1420,11%@1420,12%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50C VDD_IN 7829/5524 VDD_CPU_GPU_CV 1349/824 VDD_SOC 4001/2630
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [16%@1420,10%@1420,17%@1420,14%@1420,8%@1420,14%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@50C PMIC@100C AUX@50C CPU@50C thermal@50C VDD_IN 7829/5533 VDD_CPU_GPU_CV 1349/826 VDD_SOC 4001/2636
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [14%@1420,9%@1420,12%@1420,13%@1420,10%@1420,12%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49C GPU@50C PMIC@100C AUX@50C CPU@50.5C thermal@50C VDD_IN 7829/5543 VDD_CPU_GPU_CV 1349/828 VDD_SOC 4001/2641
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [15%@1420,12%@1420,14%@1420,12%@1420,10%@1420,13%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50.15C VDD_IN 7829/5552 VDD_CPU_GPU_CV 1349/830 VDD_SOC 4001/2647
RAM 1738/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [14%@1420,9%@1420,12%@1420,16%@1420,12%@1420,13%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@50C PMIC@100C AUX@50C CPU@50.5C thermal@50C VDD_IN 7869/5561 VDD_CPU_GPU_CV 1349/832 VDD_SOC 4001/2652
RAM 1739/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [15%@1420,12%@1420,15%@1420,15%@1420,15%@1420,9%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@50C PMIC@100C AUX@50C CPU@50.5C thermal@50.15C VDD_IN 7869/5570 VDD_CPU_GPU_CV 1388/835 VDD_SOC 4001/2658
RAM 1739/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [16%@1420,12%@1420,16%@1420,13%@1420,13%@1420,11%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50C VDD_IN 7829/5579 VDD_CPU_GPU_CV 1349/837 VDD_SOC 4001/2663
RAM 1739/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [18%@1420,11%@1420,15%@1420,11%@1420,11%@1420,13%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50.15C VDD_IN 7829/5588 VDD_CPU_GPU_CV 1349/839 VDD_SOC 4001/2668
RAM 1739/7770MB (lfb 166x4MB) SWAP 0/3885MB (cached 0MB) CPU [16%@1420,8%@1420,15%@1420,14%@1420,11%@1420,13%@1420] EMC_FREQ 0% GR3D_FREQ 0% AO@49.5C GPU@49.5C PMIC@100C AUX@50C CPU@50.5C thermal@50.15C VDD_IN 7829/5597 VDD_CPU_GPU_CV 1349/841 VDD_SOC 4001/2674
...

And jtop as following:

Besides, I have read the gst-nvvideo4linux2_src.tbz2 in public_sources.tbz2, but… still find nothing about the Non-blocking mode mentioned as O_NONBLOCK in the following sample.

// /usr/src/jetson_multimedia_api/samples/01_video_encode/video_encode_main.cpp
...
1358     /* Create NvVideoEncoder object for blocking or non-blocking I/O mode. */
1359     if (ctx.blocking_mode)
1360     {
1361         cout << "Creating Encoder in blocking mode \n";
1362         ctx.enc = NvVideoEncoder::createVideoEncoder("enc0");
1363     }
1364     else
1365     {
1366         cout << "Creating Encoder in non-blocking mode \n";
1367         ctx.enc = NvVideoEncoder::createVideoEncoder("enc0", O_NONBLOCK);
1368     }
1369     TEST_ERROR(!ctx.enc, "Could not create encoder", cleanup);
...

Sorry… I found two devices related encoder? these days, could I use both of then at the same time to improve the performance?

nvidia@nvidia-desktop:~$ ll /dev/nvhost-* | grep enc
crw-rw---- 1 root video 488, 25 4月   8 17:47 /dev/nvhost-msenc
crw-rw---- 1 root video 488, 29 4月   8 17:47 /dev/nvhost-nvenc1

Hi,
The v4l2 plugin is implemented in blocking mode and it would need significant customization to change to non-blocking mode. Have to add the polling thread like 01_video_encode.

Please try this command and share the prints:

gst-launch-1.0 -v  videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 videotestsrc is-live=1 ! video/x-raw,width=640,height=360,format=NV12 ! nvvidconv ! 'video/x-raw(memory:NVMM),width=1920,height=1080' ! nvv4l2h264enc maxperf-enable=1 ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0

We are able to see 8 threads achieving 30 fps. Please give it a try.

Seems it’s unstable, sometimes >35Hz, sometimes <25Hz

...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink4: last-message = rendered: 16, dropped: 0, current: 31.13, average: 31.13
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 15, dropped: 0, current: 28.21, average: 28.21
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink1: last-message = rendered: 14, dropped: 0, current: 26.95, average: 26.95
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink5: last-message = rendered: 12, dropped: 0, current: 23.64, average: 23.64
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink7: last-message = rendered: 14, dropped: 0, current: 27.17, average: 27.17
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink2: last-message = rendered: 18, dropped: 0, current: 35.69, average: 35.69
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink3: last-message = rendered: 18, dropped: 0, current: 34.37, average: 34.37
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink6: last-message = rendered: 17, dropped: 0, current: 32.26, average: 32.26
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink4: last-message = rendered: 38, dropped: 0, current: 41.65, average: 36.46
...
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink2: last-message = rendered: 1343, dropped: 0, current: 27.09, average: 30.45
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink7: last-message = rendered: 1345, dropped: 0, current: 27.79, average: 30.48
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 1346, dropped: 0, current: 29.74, average: 30.38
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink1: last-message = rendered: 1348, dropped: 0, current: 29.41, average: 30.36
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink6: last-message = rendered: 1351, dropped: 0, current: 27.17, average: 30.45
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink4: last-message = rendered: 1350, dropped: 0, current: 28.90, average: 30.36
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink3: last-message = rendered: 1348, dropped: 0, current: 27.88, average: 30.31
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink5: last-message = rendered: 1355, dropped: 0, current: 29.19, average: 30.40
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink7: last-message = rendered: 1361, dropped: 0, current: 30.98, average: 30.48
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink2: last-message = rendered: 1361, dropped: 0, current: 33.33, average: 30.48
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 1362, dropped: 0, current: 31.92, average: 30.40
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink1: last-message = rendered: 1365, dropped: 0, current: 33.62, average: 30.39
/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink6: last-message = rendered: 1369, dropped: 0, current: 35.01, average: 30.51
^C/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink4: last-message = rendered: 1368, dropped: 0, current: 33.92, average: 30.40
handling interrupt.
Interrupt: Stopping pipeline ...
Execution ended after 0:00:46.074013274
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
^C

Actually we encode our camera images frame by frame, thus we want to make the pipeline run very fast, which means cost <30ms, even 8 threads in parallel.

The following change seems work, and the log changes to “Opening in O_NONBLOCKING MODE”

--- a/gst-v4l2/v4l2_calls.c
+++ b/gst-v4l2/v4l2_calls.c
@@ -560,7 +560,7 @@ gst_v4l2_open (GstV4l2Object * v4l2object)
 
   /* open the device */
   v4l2object->video_fd =
-      open (v4l2object->videodev, O_RDWR /* | O_NONBLOCK */ );
+      open (v4l2object->videodev, O_RDWR | O_NONBLOCK );
 #endif
 
   if (!GST_V4L2_IS_OPEN (v4l2object))

I have tried following change, but gst pipeline crashed

--- a/gst-v4l2/gstv4l2object.h
+++ b/gst-v4l2/gstv4l2object.h
@@ -55,9 +55,9 @@ typedef struct _GstV4l2ObjectClassHelper GstV4l2ObjectClassHelper;
 
 #ifdef USE_V4L2_TARGET_NV
 #define  V4L2_DEVICE_BASENAME_NVDEC  "nvdec"
-#define  V4L2_DEVICE_BASENAME_NVENC  "msenc"
+#define  V4L2_DEVICE_BASENAME_NVENC  "nvenc1"
 #define  V4L2_DEVICE_PATH_NVDEC      "/dev/nvhost-nvdec"
-#define  V4L2_DEVICE_PATH_NVENC      "/dev/nvhost-msenc"
+#define  V4L2_DEVICE_PATH_NVENC      "/dev/nvhost-nvenc1"
 #endif
 
 /* max frame width/height */

We can make it “O_NONBLOCK mode” in this way.

However, it costs more times, 47ms to 72ms under our 8-camera situation.
What’s worse, the CPU(6-core) all meet ≈100%, while “BLOCKING mode” all ≈30%

Hi,
Since the log shows all sources can achieve 30fps, the issue seems to be multithreading does not work efficiently in OpenCV. The CPU loading in the log seems low and not expected if you have multiple encoding threads.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.