Crash in libnvmmlite_video.so

Running TX2, 28.1.
Our system is encoding h.264 video for about 20-60 minutes and sometimes crashes with the following error:

unhandled level 2 permission fault (11) at 0xffffc078327c1e, esr 0x9200000e
[ 5377.522165] pgd = ffffffc1cbbcf000
[ 5377.525604] [ffffc078327c1e] *pgd=0000000000000000, *pud=0000000000000000

[ 5377.533984] CPU: 3 PID: 2277 Comm: hawk Tainted: P        W  O    4.4.38-aeryon #1
[ 5377.541580] Hardware name: aeryon-tx2-flyer (DT)
[ 5377.546223] task: ffffffc1ea8d4b00 ti: ffffffc1b5a68000 task.ti: ffffffc1b5a68000
[ 5377.553737] PC is at 0x7f61f8ccd4
[ 5377.557075] LR is at 0x7f61f8d358
[ 5377.560424] pc : [<0000007f61f8ccd4>] lr : [<0000007f61f8d358>] pstate: 80000000
[ 5377.567817] sp : 0000007ee8dba490
[ 5377.571134] x29: 0000007ee8dba980 x28: 0000000000000001 
[ 5377.576464] x27: 0000000000000001 x26: 0000000000000001 
[ 5377.581801] x25: 0000000000000001 x24: 0000007f24014570 
[ 5377.587146] x23: 0000000000000000 x22: 0000000000000008 
[ 5377.592479] x21: 0000000000000002 x20: 0000000000000004 
[ 5377.597808] x19: 0000000000000008 x18: 0000000000000002 
[ 5377.603158] x17: 0000000000000004 x16: 0000000000000008 
[ 5377.608490] x15: 0000000000000000 x14: 0000000000000008 
[ 5377.613866] x13: 0000000000000002 x12: 0000000000000000 
[ 5377.619257] x11: 0000000000000003 x10: 0000007e98004300 
[ 5377.624621] x9 : 0000000000000001 x8 : 0000000000000003 
[ 5377.629975] x7 : 000000000000009f x6 : 0000000000000000 
[ 5377.635339] x5 : 0000000000000000 x4 : 0000000000000000 
[ 5377.640697] x3 : ffffffc078327c1e x2 : 000000000000000f 
[ 5377.646054] x1 : 0000007e98004000 x0 : 0000000000000000 

[ 5377.652932] Library at 0x7f61f8ccd4: 0x7f61f5f000 /usr/lib/tegra/libnvmmlite_video.so
[ 5377.660813] Library at 0x7f61f8d358: 0x7f61f5f000 /usr/lib/tegra/libnvmmlite_video.so
[ 5377.668656] vdso base = 0x7f9cc1c000

Backtrace in GDB gives the following:

[Current thread is 1 (Thread 0x7ee8dbb190 (LWP 2277))]
(gdb) bt
#0  0x0000007f61f8ccd4 in ?? () from /usr/lib/tegra/libnvmmlite_video.so
#1  0x0000007f61f88368 in ?? () from /usr/lib/tegra/libnvmmlite_video.so
#2  0x0000007f61f7f59c in ?? () from /usr/lib/tegra/libnvmmlite_video.so
#3  0x0000007f8b7ba6dc in ?? () from /usr/lib/tegra/libnvos.so
#4  0x0000007f8010de78 in start_thread (arg=0x7f80131000 <__pthread_keys+15712>) at pthread_create.c:331
#5  0x0000007f80095cd0 in ?? () from /lib/libc.so.6

Any thoughts as to what this is indicating?

Thanks

Another error I saw on a different run was the following.

VENC: VideoEncInputProcessing: 4376:  VideoEncFeedImage failed. Input buffer 3 sent
VENC: NvMMLiteVideoEncDoWork: 5001: BlockSide error 0x4
Event_BlockError from 3BlockAvcEnc : Error code - 4
Sending error event from 3BlockAvcEnc1:11:50.982698272 22572   0x7ef0655070 ERROR                    omx gstomx.c:496:EventHandler:<e14_omxh264enc> encoder got error: Bad parameter (0x80001005)
1:11:50.982779456 22572   0x7f3c001e30 ERROR                    omx gstomx.c:268:gst_omx_component_handle_messages:<e14_omxh264enc> encoder got error: Bad parameter (0x80001005)
1:11:50.982921792 22572   0x7f3c001e30 ERROR                    omx gstomx.c:1285:gst_omx_port_acquire_buffer:<e14_omxh264enc> Component encoder is in error state: Bad parameter
1:11:50.984580416 22572   0x7f3c001e30 WARN             omxvideoenc gstomxvideoenc.c:1331:gst_omx_video_enc_loop:<e14_omxh264enc> error: OpenMAX component in error state Bad parameter (0x80001005)
Message error 
Error source: e14_omxh264enc 
Error Msg: GStreamer encountered a general supporting library error. 
Debug Msg: /dvs/git/dirty/git-master_linux/external/gstreamer/gst-omx/omx/gstomxvideoenc.c(1331): gst_omx_video_enc_loop (): /GstPipeline:pipeline0/GstOMXH264Enc-omxh264enc:e14_omxh264enc:
OpenMAX component in error state Bad parameter (0x80001005) 
Quiting main loop, trying to restart...
1:11:51.080284416 22572   0x7f88613d90 ERROR       aeryonoverlaytee src/gstaeryonoverlaytee.c:1254:gst_aeryon_overlay_tee_handle_data:<e12_e5_aeryonoverlaytee> received error error
1:11:51.213150080 22572   0x7f88613e80 WARN                 basesrc gstbasesrc.c:2948:gst_base_src_loop:<e0_ptpsrc> error: Internal data flow error.
1:11:51.213186848 22572   0x7f88613e80 WARN                 basesrc gstbasesrc.c:2948:gst_base_src_loop:<e0_ptpsrc> error: streaming task paused, reason error (-5)
1:11:51.213440768 22572   0x7f88613e80 WARN                   queue gstqueue.c:986:gst_queue_handle_sink_event:<queue0> error: Internal data flow error.
Message error 
1:11:51.213478048 22572   0x7f88613e80 WARN                   queue gstqueue.c:986:gst_queue_handle_sink_event:<queue0> error: streaming task paused, reason error (-5)
Error source: e0_ptpsrc 
Error Msg: Internal data flow error. 
Debug Msg: gstbasesrc.c(2948): gst_base_src_loop (): /GstPipeline:pipeline0/GstPtpSrc:e0_ptpsrc:
streaming task paused, reason error (-5) 
Quiting main loop, trying to restart...
Message error 
Error source: queue0 
Error Msg: Internal data flow error. 
Debug Msg: gstqueue.c(986): gst_queue_handle_sink_event (): /GstPipeline:pipeline0/GstQueue:queue0:
streaming task paused, reason error (-5) 
Quiting main loop, trying to restart...

Please share steps to reproduce the issue. Also have you tried r28.2?

Hi phabsch,

Have you clarified the cause and resolved the problem?
Any update? Could you try with R28.2 and share the steps if still can be reproduced?

Thanks

Hi kayccc,
No I haven’t resolved the issue yet. I’ve tried simplifying the flow to strip out any custom code, but I can’t reproduce it reliably.

At the moment we have some other reasons we can’t move on the 28.2, but we will once we get a chance to address those other issues.

Are you logged in as a user which is in group “video”? I see a permission fault. Also, are you logged in directly to the Jetson, or are you using ssh?

This is run under root context.
Also, these error can happen at any time. I mostly see them around 50 minutes after I start the video encoder.

Is your file system filling up? “df -H”.

The following was brought to my attention: http://nvidia.custhelp.com/app/answers/detail/a_id/4632/~/security-bulletin%3A-nvidia-shield-tablet-security-update-for-a-media-server
It mentions specifically a use-after-free in libnvmmlite_video.so being fixed in a software update. The update came out around the end of March 2018.
SHIELD tablet uses TK1, but it’s possible that the codebase is shared.

Could this be a contributing factor?
Is the issue addressable from user space? Is the freed memory that is causing a problem from user space?

Thanks.

Hi phabsch,
The code base of L4T and Android are different, so it shouldn’t be the same. Still need your help to give steps so that we can reproduce it. Do you run gstreamer or MMAPIs?

Understood that the code base is different. Just asking if there is anything in common that may cause problems. I’ll be using gstreamer. Haven’t had time to look in to recreating this lately though. Will post more details once I make some progress.