Hi,
When running the following (relatively simple) pipeline multiple times (but never concurrently) in a single process on a TX2 on JP4.4.1 (R32.4.4), with Deepstream 5.0.1, we eventually get segfaults/pipeline stalls due to memory corruption inside the nvv4l2decoder
element.
Here, the appsrc
is supplying userspace (heap allocated - no alignment guarantees) mjpeg buffers which came from an upstream USB camera (v4l2src). For context, other parts of the application are also decoding mjpeg frames directly from the v4l2src
using nvv4l2decoder mjpeg=1
.
appsrc num-buffers=200 ! image/jpeg,framerate=15/1,width=1920,height=1080 ! tee name=t
t. ! queue !
videorate drop-only=1 average-period=1000000000 !
image/jpeg,framerate=1/5 ! queue !
nvv4l2decoder num-extra-surfaces=3 mjpeg=1 !
nvvideoconvert nvbuf-memory-type=4 interpolation-method=3 !
video/x-raw(memory:NVMM),format=NV12,width=192,height=108 !
nvvideoconvert ! videoconvert ! jpegenc quality=50 ! fakesink sync=0
t. ! queue !
nvv4l2decoder num-extra-surfaces=3 mjpeg=1 !
nvvideoconvert nvbuf-memory-type=4 src-crop='0:0:1920:1080' dest-crop='0:0:1920:1080' !
video/x-raw(memory:NVMM),format=NV12,width=1920,height=1080 !
nvvideoconvert nvbuf-memory-type=4 interpolation-method=5 !
video/x-raw(memory:NVMM),format=NV12,height=270,width=480,framerate=15/1 !
nvv4l2h264enc bitrate=50000 control-rate=0 iframeinterval=500 profile=4 maxperf-enable=1 insert-sps-pps=1 !
h264parse ! qtmux ! filesink location=/tmp/test.mp4
The gdb stack trace where the fault occurs shows it happens in gst_buffer_copy_into
in gstbuffer.c:L627
:
(gdb) bt
#0 0x0000007f9921ac6c in gst_buffer_copy_into (dest=0x7e1868ce30, src=<optimised out>, flags=<optimised out>, offset=548030926848, size=<optimised out>) at gstbuffer.c:627
#1 0x0000007f308ec0f8 in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvvideo4linux2.so
#2 0x0000007f99292eb4 in gst_task_func (task=0x7e417ce830) at gsttask.c:332
#3 0x0000007f99144440 in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
#4 0x0000007f991d8e80 in __glib_assert_msg () from /tegra_root/usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
while the other thread of that element (nvv4l2decoder) stalls in a pthread_cond_wait
inside libtegrav4l2.so
(same as this post Jetson h264 decoder flush deadlock - #12 by khizbulin):
(gdb) bt
#0 0x0000007f9943422c in futex_wake (private=<optimised out>, processes_to_wake=1, futex_word=0x7ed8086a48) at ../sysdeps/unix/sysv/linux/futex-internal.h:235
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7ed80869f0, cond=0x7ed8086a20) at pthread_cond_wait.c:628
#2 __pthread_cond_wait (cond=0x7ed8086a20, mutex=0x7ed80869f0) at pthread_cond_wait.c:655
#3 0x0000007f986e6fdc in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/tegra/libnvos.so
#4 0x0000007f10011bf0 in TegraV4L2_Poll_CPlane () from /tegra_root/usr/lib/aarch64-linux-gnu/tegra/libtegrav4l2.so
#5 0x0000007f1004f274 in plugin_ioctl () from /tegra_root/usr/lib/aarch64-linux-gnu/libv4l/plugins/nv/libv4l2_nvvideocodec.so
#6 0x0000007f32b8a5c0 in v4l2_ioctl () from /tegra_root/usr/lib/aarch64-linux-gnu/libv4l2.so.0
#7 0x0000007f308d38cc in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvvideo4linux2.so
#8 0x0000007f308d7d1c in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvvideo4linux2.so
#9 0x0000007f99220600 in gst_buffer_pool_acquire_buffer (pool=0x7f3c054e90, buffer=0x7de935bc58, params=0x0) at gstbufferpool.c:1265
#10 0x0000007f308ebd80 in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvvideo4linux2.so
#11 0x0000007f99292eb4 in gst_task_func (task=0x7ec81a73b0) at gsttask.c:332
#12 0x0000007f99144440 in ?? () from /tegra_root/usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
#13 0x0000007f991d8e80 in __glib_assert_msg () from /tegra_root/usr/lib/aarch64-linux-gnu/libglib-2.0.so.0
In gstbuffer.c, the segfault is here:
...
for (walk = GST_BUFFER_META (src); walk; walk = walk->next) {
GstMeta *meta = &walk->meta;
const GstMetaInfo *info = meta->info;
/* Don't copy memory metas if we only copied part of the buffer, didn't
* copy memories or merged memories. In all these cases the memory
* structure has changed and the memory meta becomes meaningless.
*/
if ((region || !(flags & GST_BUFFER_COPY_MEMORY)
|| (flags & GST_BUFFER_COPY_MERGE))
L627 ->>>> && gst_meta_api_type_has_tag (info->api, _gst_meta_tag_memory)) {
GST_CAT_DEBUG (GST_CAT_BUFFER,
"don't copy memory meta %p of API type %s", meta,
g_type_name (info->api));
} else if (info->transform_func) {
GstMetaTransformCopy copy_data;
...
Running p meta
and p *meta
shows
$4 = (GstMeta *) 0x7e41776498
$5 = {flags = GST_META_FLAG_NONE, info = 0x20}
Clearly info
should be a pointer to a _GstMetaInfo
into either heap allocated by gstreamer or into the.data
segment, not 0x20, so likely something is clobbering a GstBufferImpl
somehow.
The relevant code in gst-nvvideo4linux2_src/gst-v4l2/gstv4l2videodec.c
is: L1197/L1560:
// L1194
#if USE_V4L2_TARGET_NV
if (!gst_buffer_copy_into (frame->output_buffer, frame->input_buffer,
(GstBufferCopyFlags)GST_BUFFER_COPY_METADATA, 0, -1)) {
GST_DEBUG_OBJECT (decoder, "Buffer metadata copy failed \n");
}
...
// L1556
/* No need to keep input arround */
tmp = frame->input_buffer;
frame->input_buffer = gst_buffer_new ();
gst_buffer_copy_into (frame->input_buffer, tmp,
GST_BUFFER_COPY_FLAGS | GST_BUFFER_COPY_TIMESTAMPS |
GST_BUFFER_COPY_META, 0, 0);
gst_buffer_unref (tmp);
So something is corrupting memory. The fact that this only seems to happen when running the same pipeline over and over repeatedly, as opposed to when running for a large number of frames, indicates that it’s likely part of the initialisation code of elements, and not the steady state flow of buffers.
The problem looks to be very similar to this (closed but unsolved) post: Segmentation faults and memory corruption with nvv4l2decoder
To try to isolate the problem further, I ran valgrind on a minimal nvv4l2decoder pipeline:
valgrind --sim-hints=lax-ioctls gst-launch-1.0 -e videotestsrc num-buffers=1 ! video/x-raw,format=NV12,width=1920,height=1080 ! nvvidconv ! nvjpegenc ! nvv4l2decoder ! fakesink
And this showed a number of uninitialised memory bugs in the closed source dependencies of nvjpegenc
:
==4613== Conditional jump or move depends on uninitialised value(s)
==4613== at 0x5B4B3D4: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvrm_graphics.so)
==4613== by 0x5B89B97: NvDdkVicExecute (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_vic.so)
==4613== by 0x87C3707: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87B6ECB: NvDdk2dBlitExt (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x8796C6F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x87975E3: jpegTegraEncoderCompress (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x8761353: jpeg_write_raw_data (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x5ABCB07: ??? (in /usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvjpeg.so)
==4613== by 0x597C753: ??? (in /usr/lib/aarch64-linux-gnu/libgstvideo-1.0.so.0.1405.0)
==4613==
==4613== Conditional jump or move depends on uninitialised value(s)
==4613== at 0x87BCCA4: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87BCE83: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87C37E3: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87B6ECB: NvDdk2dBlitExt (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x8796C6F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x87975E3: jpegTegraEncoderCompress (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x8761353: jpeg_write_raw_data (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x5ABCB07: ??? (in /usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvjpeg.so)
==4613== by 0x597C753: ??? (in /usr/lib/aarch64-linux-gnu/libgstvideo-1.0.so.0.1405.0)
==4613==
==4613== Conditional jump or move depends on uninitialised value(s)
==4613== at 0x87BC97C: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87BCB3F: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x87BCC3F: NvDdk2dSurfaceLock (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_2d_v2.so)
==4613== by 0x8796C07: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x87975E3: jpegTegraEncoderCompress (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x8761353: jpeg_write_raw_data (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x5ABCB07: ??? (in /usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvjpeg.so)
==4613== by 0x597C753: ??? (in /usr/lib/aarch64-linux-gnu/libgstvideo-1.0.so.0.1405.0)
==4613==
==4613== Conditional jump or move depends on uninitialised value(s)
==4613== at 0x885B7BC: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so)
==4613== by 0x887B3D3: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so)
==4613== by 0x887BDF7: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvtvmr.so)
==4613== by 0x8797207: jpegTegraEncoderCompress (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x8761353: jpeg_write_raw_data (in /usr/lib/aarch64-linux-gnu/tegra/libnvjpeg.so)
==4613== by 0x5ABCB07: ??? (in /usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvjpeg.so)
==4613== by 0x597C753: ??? (in /usr/lib/aarch64-linux-gnu/libgstvideo-1.0.so.0.1405.0)
and nvvidconv
:
==4613== Conditional jump or move depends on uninitialised value(s)
==4613== at 0x5B4B3D4: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvrm_graphics.so)
==4613== by 0x5B89B97: NvDdkVicExecute (in /usr/lib/aarch64-linux-gnu/tegra/libnvddk_vic.so)
==4613== by 0x5AEAE57: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbuf_utils.so.1.0.0)
==4613== by 0x5AEC457: NvBufferTransform (in /usr/lib/aarch64-linux-gnu/tegra/libnvbuf_utils.so.1.0.0)
==4613== by 0x5904FBB: ??? (in /usr/lib/aarch64-linux-gnu/gstreamer-1.0/libgstnvvidconv.so)
==4613== by 0x5A0711F: ??? (in /usr/lib/aarch64-linux-gnu/libgstbase-1.0.so.0.1405.0)
==4613==
and possibly a a memcpy bug in nvv4l2decoder
==4613== Source and destination overlap in memcpy(0x8960180, 0x8960180, 208)
==4613== at 0x484B080: __GI_memcpy (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==4613==
When trying to valgrind the same thing on an Orin Nano on R35.4.1, I hit the same problem as this post (also closed but unsolved): Unhandled instruction by Valgrind in libcuda, so I can’t comment on whether the problem is fixed in the latest software versions:
valgrind --sim-hints=lax-ioctls gst-launch-1.0 videotestsrc ! video/x-raw,format=NV12,width=1920,height=1080 ! nvvidconv ! nvjpegenc ! nvv4l2decoder num-extra-surfaces=3 ! fakesink
==468852== Memcheck, a memory error detector
==468852== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==468852== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==468852== Command: gst-launch-1.0 videotestsrc ! video/x-raw,format=NV12,width=1920,height=1080 ! nvvidconv ! nvjpegenc ! nvv4l2decoder num-extra-surfaces=3 ! fakesink
==468852==
ARM64 front end: load_store
disInstr(arm64): unhandled instruction 0xB8A18002
disInstr(arm64): 1011'1000 1010'0001 1000'0000 0000'0010
==468852== valgrind: Unrecognised instruction at address 0x7859958.
==468852== at 0x7859958: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x77D7A7B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x79B356B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x7807013: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x6589ADF: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x4B9F3B7: __pthread_once_slow (pthread_once.c:116)
==468852== by 0x65CED93: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x65801F7: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x65A4C53: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x6394953: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x400E8B3: call_init.part.0 (dl-init.c:72)
==468852== by 0x400E9B3: call_init (dl-init.c:30)
==468852== by 0x400E9B3: _dl_init (dl-init.c:119)
==468852== Your program just tried to execute an instruction that Valgrind
==468852== did not recognise. There are two possible reasons for this.
==468852== 1. Your program has a bug and erroneously jumped to a non-code
==468852== location. If you are running Memcheck and you just saw a
==468852== warning about a bad jump, it's probably your program's fault.
==468852== 2. The instruction is legitimate but Valgrind doesn't handle it,
==468852== i.e. it's Valgrind's fault. If you think this is the case or
==468852== you are not sure, please let us know and we'll try to fix it.
==468852== Either way, Valgrind will now raise a SIGILL signal which will
==468852== probably kill your program.
==468852==
==468852== Process terminating with default action of signal 4 (SIGILL)
==468852== Illegal opcode at address 0x7859958
==468852== at 0x7859958: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x77D7A7B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x79B356B: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x7807013: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1)
==468852== by 0x6589ADF: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x4B9F3B7: __pthread_once_slow (pthread_once.c:116)
==468852== by 0x65CED93: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x65801F7: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x65A4C53: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x6394953: ??? (in /usr/lib/aarch64-linux-gnu/tegra/libnvbufsurftransform.so.1.0.0)
==468852== by 0x400E8B3: call_init.part.0 (dl-init.c:72)
==468852== by 0x400E9B3: call_init (dl-init.c:30)
==468852== by 0x400E9B3: _dl_init (dl-init.c:119)
==468852==
==468852== HEAP SUMMARY:
==468852== in use at exit: 2,138,073 bytes in 24,733 blocks
==468852== total heap usage: 46,982 allocs, 22,249 frees, 6,999,468 bytes allocated
==468852==
==468852== LEAK SUMMARY:
==468852== definitely lost: 16,384 bytes in 1 blocks
==468852== indirectly lost: 0 bytes in 0 blocks
==468852== possibly lost: 4,308 bytes in 62 blocks
==468852== still reachable: 2,022,029 bytes in 24,343 blocks
==468852== of which reachable via heuristic:
==468852== length64 : 80 bytes in 2 blocks
==468852== newarray : 1,552 bytes in 17 blocks
==468852== suppressed: 0 bytes in 0 blocks
==468852== Rerun with --leak-check=full to see details of leaked memory
==468852==
==468852== For lists of detected and suppressed errors, rerun with: -s
==468852== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction (core dumped)
It might be a good idea for the software teams to run all Nvidia gstreamer elements through valgrind before each release, since most of the guilty components are closed source.
Since the target platform for this problem is TX2 (Orin on R35.4.1 was only for comparison), I’d ideally like a patch that can be applied to the guilty library without requiring a full Jetpack update.
If you have any suggestions of how to progress this investigation, I’d be very grateful.