Video encoder performance when using appsink (omxh264enc)

Hello,

Update to below: I’m pretty sure that this is caused by using an appsink. The question still remains: why?

I have a GStreamer pipeline in my app to perform video encoding. I have the exact same app running on my TX2 and my Xavier.

Strangely, I’ve observed (and confirmed with profiling) that the encoding is not able to keep up with the frames on my Xavier, even though it is fine on my TX2.

I am still debugging to try to isolate where exactly the slowdown is but I would like to know if there are any “gotchas” where the Xavier is perhaps slower or something needs to be done to exploit the performance properly?

My pipeline looks something like this:

RGBA frames ->
app-src ->
nvvidconv -> I420 frames ->
omxh264enc ->
h264parse ->
matroskamux ->
appsink

Hi logidelic,

Please try to flash with below config file:

sudo ./flash jetson-xavier-maxn mmcblk0p1

Run below setting:

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

Please check the performance again, Thanks!

Thank you for the response. Unfortunately I haven’t seen any improvement. Here are some relevant details:

The app I am running is identical on the TX2 and the Xavier. I performed some detailed GStreamer profiling and have isolated the omxh264enc encoding portion of the pipeline to be the bottleneck. Here’s an interesting comparison between the speeds observed for this encoding step (just omxh264enc) on Xavier vs TX2:

TX2 (Frame res: 1280 x 720, Source framerate: 16fps):

Enc-frame-beg
Enc-frame-end: 5ms
Enc-frame-beg
Enc-frame-end: 5ms
Enc-frame-beg
Enc-frame-end: 4ms
...
Encoding frame avg - 1ms
Encoding frame avg - 2ms
...

Xavier (Frame res: 1280 x 720, Source framerate: 16fps):

Enc-frame-beg
Enc-frame-beg
Enc-frame-beg
Enc-frame-end: 62ms
Enc-frame-beg
Enc-frame-end: 18ms
Enc-frame-end: 27ms
Enc-frame-end: 29ms
Enc-frame-beg
Enc-frame-beg
Enc-frame-beg
Enc-frame-end: 95ms
Enc-frame-beg
Enc-frame-end: 65ms
Enc-frame-end: 71ms
Enc-frame-end: 712ms
...
Encoding frame avg - 160ms
Encoding frame avg - 187ms
...

As you suggsted, I tried re-flashing with jetson-xavier-maxn.conf. However, just to be clear, I did not use your command-line (it didn’t work). Instead I used:

sudo ./flash.sh  -r -k kernel-dtb jetson-xavier-maxn mmcblk0p1

I assume that this is ok? Finally, after executing nvpmodel and jetson_clocks, I see the following stats. Is this to be expected?

SOC family:tegra194  Machine:jetson-xavier
Online CPUs: 0-7
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu1: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu2: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu3: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu4: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu5: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu6: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
cpu7: Gonvernor=schedutil MinFreq=2265600 MaxFreq=2265600 CurrentFreq=2265600
GPU MinFreq=1377000000 MaxFreq=1377000000 CurrentFreq=1377000000
EMC MinFreq=204000000 MaxFreq=2133000000 CurrentFreq=2133000000 FreqOverride=1
Fan: speed=255

Any help would be greatly appreciated.

Here’s another performance comparison between the TX2 and the Xavier. The times in ms are the average times to get through the named pipeline element.

I also notice that matroskamux lags on Xavier… Then again my understanding of how that element works is limited, so maybe that’s normal given the lag in the encoder…?

TX2

gst:queue       - 0ms  - 11/s
gst:nvvidconv   - 15ms - 15/s
gst:capsfilter  - 0ms  - 15/s
gst:omxh264enc  - 12ms - 15/s
gst:capsfilter  - 0ms  - 15/s
gst:h264parse   - 0ms  - 15/s
gst:matroskamux - 0ms  - 30/s

Xavier

gst:queue       - 334ms - 11/s
gst:nvvidconv   - 2ms   - 11/s
gst:capsfilter  - 0ms   - 11/s
gst:omxh264enc  - 236ms - 10/s
gst:capsfilter  - 0ms   - 10/s
gst:h264parse   - 0ms   - 10/s
gst:matroskamux - 45ms  - 22/s

Thanks again.

Hi,
We have a property of profiling performance:

MeasureEncoderLatency: Enable Measure Encoder latency Per Frame
                        flags: readable, writable, changeable only in NULL or READY state

Is it also observed if you configure ‘MeasureEncoderLatency=true’?

PLease also try to enable max encoder clock:
https://devtalk.nvidia.com/default/topic/1032771/jetson-tx2/no-encoder-perfomance-improvement-before-after-jetson_clocks-sh/post/5255605/#5255605

Thank you DaneLLL. I will try these today.

Can you explain the implication of rebuilding the encoder with max encoder clock enabled? I guess there’s a reason why it’s not enabled by default or configurable via normal GStreamer element settings…

Thanks again.

Also, is there a new gstomx source for package for R31? I don’t see one listed at https://developer.nvidia.com/embedded/downloads .

I tried download the most recent one but am getting the following error when building (even though I followed the instructions in the README and libgstnvegl-1.0.so is available…) Any ideas about this?

checking for GST_EGL... configure: error: Package requirements (gstreamer-egl-1.0) were not met:
No package 'gstreamer-egl-1.0' found

$ ~/dev/gstomx1_src/gst-omx1$ env | grep GST
GST_EGL_LIBS=-lgstnvegl-1.0 -lEGL -lX11 -lgstreamer-1.0 -lgobject-2.0 -lglib-2.0


$ ~/dev/gstomx1_src/gst-omx1$ ls -l /usr/lib/aarch64-linux-gnu/libgstnvegl*
lrwxrwxrwx 1 root root    47 Mar  4 16:47 /usr/lib/aarch64-linux-gnu/libgstnvegl-1.0.so -> /usr/lib/aarch64-linux-gnu/libgstnvegl-1.0.so.0
-rwxr-xr-x 1 root root 12768 Feb 20 16:35 /usr/lib/aarch64-linux-gnu/libgstnvegl-1.0.so.0

I attempted to isolate the problem in a small self-contained app. Attached. (I’m actually not confident that it is exactly the same problem, but it is at least a slowdown in omxh264enc that I don’t understand.)

To build and run the test app:

mkdir gsttest/build
cd gsttest/build
cmake ..
make
./gsttest 'rtsp://some-valid-rtsp-url'

If I run the test as-is (uses an appsink), I get the following output from MeasureEncoderLatency:

KPI: omx: frameNumber= 183 encoder= <b>198 ms</b> pts= 15450512433
KPI: omx: frameNumber= 184 encoder= <b>267 ms</b> pts= 15519212028
KPI: omx: frameNumber= 185 encoder= <b>336 ms</b> pts= 15588344960
KPI: omx: frameNumber= 186 encoder= <b>405 ms</b> pts= 15646600118

If I set the DO_FILESINK costant at the top of gsttest.cpp to true, I get the following output from MeasureEncoderLatency:

KPI: omx: frameNumber= 882 encoder= <b>9 ms</b> pts= 61082729897
KPI: omx: frameNumber= 883 encoder= <b>9 ms</b> pts= 61150323615
KPI: omx: frameNumber= 884 encoder= <b>9 ms</b> pts= 61219083872
KPI: omx: frameNumber= 885 encoder= <b>10 ms</b> pts= 61281032891

How could having an appsink instead of a filesink affect the encoder performance? (If I comment out everything in onAppsinkNewSample I get the same bad result, so it is not due to file writing blocking or similar.)

gsttest.tar.gz (2.22 KB)

Hi,
We compare below test on r28.2.1/TX2 and r31.1/Xavier and see much better result on Xavier.

[file source]
http://www.dvdloc8.com/clip.php?movieid=12954&clipid=1
[setting]

$ sudo nvpmodel -m 0
$ sudo ./jetson_clocks.sh

[command]

$ gst-launch-1.0 filesrc location= trailer.mp4 ! qtdemux ! h264parse ! omxh264dec ! omxh264enc ! fpsdisplaysink video-sink=fakesink sync=false text-overlay=false -v

[fps of TX2]

/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 1681, dropped: 0, current: 99.50, average: 158.82

[fps of Xavier]

/GstPipeline:pipeline0/GstFPSDisplaySink:fpsdisplaysink0: last-message = rendered: 1678, dropped: 0, current: 408.84, average: 417.99

Looks like the bottleneck is not at encoder. Have you checked below operation?
RGBA frames ->
app-src ->
nvvidconv -> I420 frames ->
Maybe copying RGBA frames to gstbuffer is slower?

Hi Dane,

Very much appreciate the response. There is clearly something different with our scenarios. A few notes/questions:

  • Did you look at my test code that exhibits a slowdown? (Shows the same slowdown on TX2, but I'm not convinced it's the main issue.)
  • My test code doesn't do any RGB conversion.
  • As I mentioned, even commenting out all of onAppsinkNewSample produces the same problem, so it's not related to disk activity or copying gstreamer buffers (at least from the app's perspective).
  • In any case, even if it were something like that, would it really show up in ms numbers output by MeasureEncoderLatency? If so can you explain how? I thought that was simply measuring the encoder performance?

Thanks again.

Hi,
We checked your code and it runs with rtspsrc. We try to eliminate rtspsrc and use filesrc to run video playback. In local video playback, HW decoder and encoder runs fine to give good performance.

On r31.1, gstreamer version is upgraded to 1.14.1 and it seems the mechanism of sync is changed. We also see certain issue in running DS3.0 with rtspsrc:
https://devtalk.nvidia.com/default/topic/1046315/deepstream-sdk-on-jetson/template-config-file-for-camera-and-rtsp/post/5314962/#5314962
So for using rtspsrc, could you try to set ‘sync=false’ in sink and check again?

Besides, the source code is not at download center but at
https://developer.nvidia.com/embedded/dlc/l4t-sources-31-1-0
You may check MeasureEncoderLatency at gstomxvideoenc.c-gstomx1_src.tbz2
It profiles the time between each input and output frame.

Thank you Dane. I had actually come to the same conclusion just before your post, but I very much appreciate the details.

I still do not understand how this can possibly affect the MeasureEncoderLatency numbers, but I will examine the source and try to wrap my head around it.

Thank you again!