[TX2] program crashes with multimedia API during stability test

During our stability test of our Video Analysis Product on TX2, we encountered several crashes caused by the decoder.
Please confirm this as a BUG of JetPack or misuse of multimedia API.

[Reproduce]
Input: 12x 1080p@25FPS H264 RTSP

Start video analyzer program
run for 6~7 days (with no reconnect or error log)
Crash in libnvmmlite_video.so with the following Backtrace

[Test Software Environment]
JetPack 3.3
Decoder: multimedia API (instead of gstreamer)
NVPMode: Max-N, Fixed at highest clock rate
No DeepStream SDK

Fixed a bug according to the following link.
https://devtalk.nvidia.com/default/topic/1036111/jetson-tx2/-mmapi-decode-more-than-one-h264-file-get-problem-when-trying-to-modify-00_video_decode/post/5265055/#5265055

[Backtrace]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000007f5e352bd8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so
[Current thread is 1 (Thread 0x7d1111d050 (LWP 17596))]
(gdb) bt
#0  0x0000007f5e352bd8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so
#1  0x0000007f4b0a7688 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvparser.so
#2  0x0000007f4b0a8c60 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvparser.so
#3  0x0000007f4b0a5f6c in video_parser_parse () from /usr/lib/aarch64-linux-gnu/tegra/libnvparser.so
#4  0x0000007f5e34ac90 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so
#5  0x0000007f5e3e36dc in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvos.so
#6  0x0000007f5e803fc4 in start_thread (arg=0x7f5e3e36a0) at pthread_create.c:335
#7  0x0000007f5e5082e0 in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:89
dmesg log
 
[Tue Jan 15 16:01:29 2019] track_console_f[17596]: unhandled level 2 translation fault (11) at 0x00000070, esr 0x92000046
[Tue Jan 15 16:01:29 2019] pgd = ffffffc1a4ba8000
[Tue Jan 15 16:01:29 2019] [00000070] *pgd=000000022ff2d003, *pud=000000022ff2d003, *pmd=0000000000000000
 
[Tue Jan 15 16:01:29 2019] CPU: 4 PID: 17596 Comm: track_console_f Not tainted 4.4.38-tegra #1
[Tue Jan 15 16:01:29 2019] Hardware name: quill (DT)
[Tue Jan 15 16:01:29 2019] task: ffffffc1a6d50c80 ti: ffffffc137140000 task.ti: ffffffc137140000
[Tue Jan 15 16:01:30 2019] PC is at 0x7f5e352bd8
[Tue Jan 15 16:01:30 2019] LR is at 0x7f5e352ff8
[Tue Jan 15 16:01:30 2019] pc : [<0000007f5e352bd8>] lr : [<0000007f5e352ff8>] pstate: 60000000
[Tue Jan 15 16:01:30 2019] sp : 0000007d1111a8c0
[Tue Jan 15 16:01:30 2019] x29: 0000007d1111c840 x28: 00000000ffffffff
[Tue Jan 15 16:01:30 2019] x27: 000000007b2296b0 x26: 0000000000000000
[Tue Jan 15 16:01:30 2019] x25: 0000000000007c9a x24: 0000000000000001
[Tue Jan 15 16:01:30 2019] x23: 0000007f5e3d9000 x22: 0000000000000000
[Tue Jan 15 16:01:30 2019] x21: 00000000ad389000 x20: 000000008772a000
[Tue Jan 15 16:01:30 2019] x19: 00000000ad388000 x18: 0000000000000003
[Tue Jan 15 16:01:30 2019] x17: 0000007f5e4baa40 x16: 0000007f5e3d92f0
[Tue Jan 15 16:01:30 2019] x15: 000000004774d000 x14: 1010101010101010
[Tue Jan 15 16:01:30 2019] x13: 1010101010101010 x12: 1010101010101010
[Tue Jan 15 16:01:30 2019] x11: 1010101010101010 x10: 1010101010101010
[Tue Jan 15 16:01:30 2019] x9 : 1010101010101010 x8 : 1010101010101010
[Tue Jan 15 16:01:30 2019] x7 : 1010101010101010 x6 : 0000000000000001
[Tue Jan 15 16:01:30 2019] x5 : 000000008772a420 x4 : 0000000000000000
[Tue Jan 15 16:01:30 2019] x3 : 0000000000000000 x2 : 0000000000001060
[Tue Jan 15 16:01:30 2019] x1 : 0000000000000002 x0 : 000000008b126800
 
[Tue Jan 15 16:01:30 2019] Library at 0x7f5e352bd8: 0x7f5e339000 /usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so
[Tue Jan 15 16:01:30 2019] Library at 0x7f5e352ff8: 0x7f5e339000 /usr/lib/aarch64-linux-gnu/tegra/libnvmmlite_video.so
[Tue Jan 15 16:01:30 2019] vdso base = 0x7f8150b000

Hi hiprince,
Can you tell more detailed info about the reproduced step?

  1. The explicit step for setup RTSP input to TX2
  2. The explicit “video analyzer program” you were using to start? Is it the comment 1 in https://devtalk.nvidia.com/default/topic/1036111/jetson-tx2/-mmapi-decode-more-than-one-h264-file-get-problem-when-trying-to-modify-00_video_decode/post/5265055/#5265055 ?
  3. Does the issue only happen in Max-N NVPMode?

Thanks.

  1. The explicit step for setup RTSP input to TX2
    It’s a RTSP server simulating IPC(IP Cameras). I think you can use any IPC (with rtsp function)
  2. The explicit “video analyzer program” you were using to start? Is it the comment 1 in https://devtalk.nvidia.com/default/topic/1036111/jetson-tx2/-mmapi-decode-more-than-one-h264-file-get-problem-when-trying-to-modify-00_video_decode/post/5265055/#5265055 ?

No, It’s a program of our product.

  1. Does the issue only happen in Max-N NVPMode?
    We didn’t test in other modes.

It’s very hard to reproduce the crash. We are running 10 TX2 instances for 6 days and only 1 crashes. We plan to add this test: Decode 16x 1080p H264 RTSP video input based on sample code multimedia API.
If it can reproduce the crash, we can post the code & reproduce steps.

Hi hiprince,
Thanks for your info. But you mentioned in the step 2 “start video analyzer program”.
So, how can I setup this step? What’s the code?

The “video analyzer program” is a binary of our product, we can’t post the code.
This is why I propose to use multimedia API sample code to reproduce the bug.

It shows in backtrace that the crash happens in video_parser_parse(), therefore we guess the video header may be corrupted with unknown reason. Does this corrupted video header cause some corner case or error condition?

Hi hiprince,
Thanks for your info. I agree to use multimedia API sample code to reproduce firstly in order to narrow down.
Please go ahead for your plan to add the test: Decode 16x 1080p H264 RTSP video input based on sample code multimedia API.
If possible, please share the modified version of the code(e.g. the 00 sample in MM-API) here.
I can try to run at the same time with the same code as yours.
Thanks.