High CPU usage in omxh264dec and low performance as compared to nv_omx_h264dec

Hi,

I am getting high CPU usage when using omxh264dec as compared to nv_omx_h264dec.

Following post also describes something similar (by some other user):
https://devtalk.nvidia.com/default/topic/965272/jetson-tk1/omxh264dec-vs-nv_omx_h264dec/

On this post Shane has clarified that the CPU usage difference must be due to the different versions of gstreamer framework but ideally newer version should perform better than the older one.

When using omxh264dec, tegrastats shows 0% against GR3D.
[PS: I have enabled performance mode for CPU + set GPU to highest clock rate. Also, running tegrastat with root]

When using gstreamer-1.0, I am getting 50 FPS @ 96% CPU [h264 (Main) --> NV12]
Command : gstreamer-1.0 filesrc location= ! avidemux ! queue ! h264parse ! omxh264dec ! fakesink -e

But when I was using gstreamer-0.10, CPU usage was pretty less. So, whats the differance b/w gstreamer-0.10 and gstreamer-1.0. Ideally, new version should improve/optimize things.

It is also possible that my interpretation of results/stats is wrong. Please correct me if this is the case.

My last doubt is, whether GPU does H264 video decoding or is there any other hardware for video decoding because in tegrastats GR3D shows 0% in both the cases but stats againts EMC, AVP, VDE shows change?

Hi techbull,
Please check how the time difference between gstreamer-0.10 and 1.0, for 1.0 hooking to fakesink(without rendering), it decodes in max performance and you will see short execution time and high CPU usage.

The comment above is wrong. The difference is from gstreamer frameworks as said in
https://devtalk.nvidia.com/default/topic/965272/jetson-tk1/omxh264dec-vs-nv_omx_h264dec/post/4987085/#4987085

The difference is not much in video rendering:

Hi DaneLL,

I will try both the commands and let you know about the outcome.

In the mean while can you please clarify following:

Whether GPU does H264 video decoding or is there any other hardware for video decoding because in tegrastats continuously shows 0% against GR3D (apart from some random spikes) in both the cases (omxh264dec & nv_omx_h264dec) but stats againts EMC, AVP, VDE shows change?

Hi techbull,
The HW engine is VDE(Video DEcode). It is an independent engine.

Sorry DaneLLL for late reply.

I tried both the commands, I am also not seeing much difference.

But probably this is because processing is getting constrained with display FPS.
Did you try it with fakesink? This will show unconstrained stats.

Also, gstreamer-0.10 is taking much lesser time than gstreamer-1.0 (4 sec vs 17 sec for a 1MP video containing 2000 frames).

Following is the output of tegra stats:
gstreamer-0.10 :
[When nothing happening]
RAM 571/1925MB (lfb 199x4MB) cpu [4%,0%,0%,1%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0
RAM 571/1925MB (lfb 199x4MB) cpu [2%,3%,1%,2%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0
RAM 571/1925MB (lfb 199x4MB) cpu [5%,0%,0%,0%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0

gst-launch-0.10 -e filesrc location=/tmp/H264.avi ! avidemux ! queue ! h264parse ! nv_omx_h264dec ! fakesink -e
[When decoding video]
RAM 576/1925MB (lfb 199x4MB) cpu [11%,5%,6%,4%]@1956 EMC 21%@924 AVP 6%@300 VDE 480 GR3D 0%@852 EDP limit 0
RAM 577/1925MB (lfb 199x4MB) cpu [7%,6%,6%,5%]@1956 EMC 20%@924 AVP 8%@300 VDE 480 GR3D 0%@852 EDP limit 0
RAM 577/1925MB (lfb 199x4MB) cpu [7%,3%,3%,9%]@1956 EMC 20%@924 AVP 8%@300 VDE 480 GR3D 0%@852 EDP limit 0

gstreamer-1.0 :
[When nothing happening]
RAM 573/1925MB (lfb 198x4MB) cpu [3%,1%,0%,0%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0
RAM 573/1925MB (lfb 198x4MB) cpu [4%,3%,0%,2%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0
RAM 573/1925MB (lfb 198x4MB) cpu [4%,1%,0%,0%]@1956 EMC 5%@924 AVP 0%@204 VDE 120 GR3D 0%@852 EDP limit 0

gst-launch-1.0 -e filesrc location=/tmp/H264.avi ! avidemux ! queue ! h264parse ! omxh264dec ! fakesink -e
[When decoding video]
RAM 583/1925MB (lfb 198x4MB) cpu [13%,31%,6%,48%]@1956 EMC 14%@924 AVP 2%@300 VDE 480 GR3D 0%@852 EDP limit 0
RAM 584/1925MB (lfb 198x4MB) cpu [4%,12%,4%,74%]@1956 EMC 14%@924 AVP 2%@300 VDE 480 GR3D 0%@852 EDP limit 0
RAM 583/1925MB (lfb 198x4MB) cpu [28%,7%,35%,29%]@1956 EMC 15%@924 AVP 2%@300 VDE 480 GR3D 0%@852 EDP limit 0

Hi techbull,
How does it go if you run
gst-launch-1.0 filesrc location= videoplayback1.mp4 ! qtdemux ! h264parse ! omxh264dec ! nvvidconv ! ‘video/x-raw(memory:NVMM),format=I420’ ! fakesink & sudo ./tegrastats

Hi DaneLLL,

It doesn’t run and fails with following error message:

i: GStreamer-CRITICAL **: gst_query_set_nth_allocation_pool: assertion ‘index < array->len’ failed
**
ERROR:/dvs/git/dirty/git-master_linux/multimedia/nvgstreamer/gst-nvvidconv-1.0/gstnvvconv.c:172:gst_nv_filter_memory_allocator_alloc_dummy: code should not be reached
Aborted [/i]

omxh264dec gives NV12 as output, so by using nvvidconv with (memory:NVMM) , we are just telling our pipeline to convert NV12 to I420 within hwMemory rather than RAM.

Isn’t this some extra work which will be done in addition to decoding (and will be done after we get NV12 output by omxh264dec)? If it is some additional work, why should it improve performance?

Hi techbull,
I can run it on r21.5. Are you on this revision?

omxh264dec gives output in video/x-raw(memoryNVMM). In linking omxh264dec ! fakesink, there is CPU buffer copy doing video/x-raw(memory:NVMM) -> video/x-raw. By linking omxh264dec ! nvvidconv ! ‘video/x-raw(memory:NVMM),format=I420’ ! fakesink. Thers is HW buffer copy doing video/x-raw(memory:NVMM) -> video/x-raw(memory:NVMM)

How to find that?

gst-launch-1.0 --version gives me :

gst-launch-1.0 version 1.2.4
GStreamer 1.2.4

Hi,

I am running r21.3.
Let me check upgrade method.

ubuntu@tegra-ubuntu:~$ head -1 /etc/nv_tegra_release

R21 (release), REVISION: 5.0, GCID: 7273100, BOARD: ardbeg, EABI: hard, DATE: Wed Jun 8 04:19:09 UTC 2016

I am updating to r21.5. Then I will get back to you and share the results.

Hi DaneLLL,

Yes, the command you told later on is taking roughly the same time as nv_omx_h264.

gst-launch-1.0 filesrc location= videoplayback1.mp4 ! qtdemux ! h264parse ! omxh264dec ! nvvidconv ! 'video/x-raw(memory:NVMM),format=I420' ! fakesink

But I have few questions:

  1. In both the cases, we are taking output in fakesink. Then, why omxh264dec and nv_omx_h264dec's behavior is different i.e. one is copying to HW buffer where as other is copying to CPU buffer.
  2. How to get this kind of info that the decoding is happening in NVMM and then it is getting copied in CPU buffer? Checking ``` gst-inspect-1.0 omxh264dec ``` only tells video/x-raw as its src capability. It doesn't tell anything about the memory (HW vs CPU).
  3. As a user, one may want to get this decoded and converted data (currently in HW buffer) to its output buffer (probably a user allocated uchar buffer), it can be done via appsink using following command:
    gst-launch-1.0 -e filesrc location=/tmp/H264.avi ! avidemux ! queue ! h264parse ! omxh264dec ! nvvidconv ! 'video/x-raw(memory:NVMM),format=I420' ! appsink
    

    But, this command is not working.

  4. What is the most efficient way of decoding H264 stream and taking decoded output in user memory?

Hi techbull,
For post-processing on TK1, you can get CPU buffer:
https://devtalk.nvidia.com/default/topic/1011376/jetson-tx1/gstreamer-decode-live-video-stream-with-the-delay-difference-between-gst-launch-1-0-command-and-appsink-callback/post/5160929/#5160929

or EGLImages:
https://devtalk.nvidia.com/default/topic/1006870/jetson-tk1/processing-video-with-eglimages-/post/5140825/#5140825

The CPU buffer case requires one NVMM -> CPU memcpy for each buffer. The EGLImage case is zero buffer copy.

Hi DaneLLL,

I tested this video “http://www.dvdloc8.com/clip.php?movieid=12954&clipid=1” which is approx 1.5 MP (1920x816). I am also getting 23 FPS but on a 2 MP video, I am getting only 7 FPS.
Can you please validate the results for a 2 MP video?

Output is as follows: (Result is same on both TK1 and TX1):

Inside NvxLiteH264DecoderLowLatencyInitNvxLiteH264DecoderLowLatencyInit set DPB and MjstreamingInside NvxLiteH265DecoderLowLatencyInitNvxLiteH265DecoderLowLatencyInit set DPB and MjstreamingNvMMLiteOpen : Block : BlockType = 261 
TVMR: NvMMLiteTVMRDecBlockOpen: 7580: NvMMLiteBlockOpen 
NvMMLiteBlockCreate : Block : BlockType = 261 
TVMR: cbBeginSequence: 1166: BeginSequence  1920x1088, bVPR = 0, fFrameRate = 15.000000
TVMR: LowCorner Frequency = 180000 
TVMR: cbBeginSequence: 1545: DecodeBuffers = 5, pnvsi->eCodec = 4, codec = 0 
TVMR: cbBeginSequence: 1606: Display Resolution : (1920x1080) 
TVMR: cbBeginSequence: 1607: Display Aspect Ratio : (1920x1080) 
TVMR: cbBeginSequence: 1649: ColorFormat : 5 
TVMR: cbBeginSequence:1660 ColorSpace = NvColorSpace_YCbCr709
TVMR: cbBeginSequence: 1790: SurfaceLayout = 3
TVMR: cbBeginSequence: 1868: NumOfSurfaces = 9, InteraceStream = 0, InterlaceEnabled = 0, bSecure = 0, MVC = 0 Semiplanar = 1, bReinit = 1, BitDepthForSurface = 8 LumaBitDepth = 8, ChromaBitDepth = 8, ChromaFormat = 5
Allocating new output: 1920x1088 (x 11), ThumbnailMode = 0
TVMR: FrameRate = 7 
TVMR: NVDEC LowCorner Freq = (42000 * 1024) 
TVMR: FrameRate = 7.500002 
TVMR: FrameRate = 7.500002 
TVMR: FrameRate = 7.500002

It looks like the 2MP video is in 7fps itself. What is the framerate shown in Mediainfo https://mediaarea.net/en/MediaInfo ?

Why should it get constrained with FPS? Shouldn’t it decode as many frames as it can because source is a video file.

The decoder needs output buffers for next decoding frames. The output buffers will be returned to decoder after being rendered out. If the video file is in 7fps, it renders output buffers in 7fps, making decoding run at 7fps, not as fast as possible.

Thanks Dane.

I checked for another 2MP video, I am able to get 20 FPS (= FPS of video).

But I am not rendering the video right now.
What can we do if decoding is the only goal?

I am running following command:

gst-launch-1.0 filesrc location=~/Videos/2MP_h264.avi ! avidemux ! queue ! h264parse ! omxh264dec ! videoconvert ! capsfilter caps=video/x-raw,format=RGB ! appsink name=sink

Please refer to https://devtalk.nvidia.com/default/topic/1011376/jetson-tx1/gstreamer-decode-live-video-stream-with-the-delay-difference-between-gst-launch-1-0-command-and-appsink-callback/post/5160929/#5160929

And run with "appsink name=mysink sync=false "