Random double free or corruption in libnvscf.so

MarkusHess · July 6, 2021, 8:58am

Hi,

I am getting a double free or corruption (out) error when capturing images from three cameras on the Jetson TX2. The backtrace shows that this occures inside the libnvscf.so:

(gdb) where
#0  __GI_abort () at abort.c:107
#1  0x0000007f7f5f868c in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f7f6b96f8 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#2  0x0000007f7f5fea04 in malloc_printerr (str=str@entry=0x7f7f6b53d8 "double free or corruption (out)") at malloc.c:5342
#3  0x0000007f7f600664 in _int_free (av=0x7f7f6dfa70 <main_arena>, p=0x7f7b507920, have_lock=<optimized out>) at malloc.c:4308
#4  0x0000007f7aee9288 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#5  0x0000007f7af422d4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#6  0x0000007f7af476dc in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#7  0x0000007f7af4b6c8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#8  0x0000007f7af1de58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
#9  0x0000007f7ad44628 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvos.so
#10 0x0000007f7f95d088 in start_thread (arg=0x7fc306a30f) at pthread_create.c:463
#11 0x0000007f7f65bffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) bt full
#0  __GI_abort () at abort.c:107
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {18446744073709551615 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#1  0x0000007f7f5f868c in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f7f6b96f8 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
        ap = {__stack = 0x7f6df36f50, __gr_top = 0x7f6df36f50, __vr_top = 0x7f6df36f20, __gr_offs = -40, __vr_offs = 0}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#2  0x0000007f7f5fea04 in malloc_printerr (str=str@entry=0x7f7f6b53d8 "double free or corruption (out)") at malloc.c:5342
No locals.
#3  0x0000007f7f600664 in _int_free (av=0x7f7f6dfa70 <main_arena>, p=0x7f7b507920, have_lock=<optimized out>) at malloc.c:4308
        size = 367832723872
        fb = <optimized out>
        nextchunk = 0xd51fda6ac0
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        __PRETTY_FUNCTION__ = "_int_free"
#4  0x0000007f7aee9288 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
No symbol table info available.
#5  0x0000007f7af422d4 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
No symbol table info available.
#6  0x0000007f7af476dc in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
No symbol table info available.
#7  0x0000007f7af4b6c8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
No symbol table info available.
#8  0x0000007f7af1de58 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvscf.so
No symbol table info available.
#9  0x0000007f7ad44628 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvos.so
No symbol table info available.
#10 0x0000007f7f95d088 in start_thread (arg=0x7fc306a30f) at pthread_create.c:463
        pd = 0x7fc306a30f
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {547305521424, 548732838672, 548732838670, 548732838671, 0, 4096, 548732838672, 547601502208, 547305521424, 1, 547305519376, 4506411868204586633, 0, 4506411867896741829, 0, 0, 0, 0, 0, 
                0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#11 0x0000007f7f65bffc in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
No locals.
(gdb)

The crash is random. Sometimes it happens after some minutes, sometime it take multiple hours. This is happening with L4T 32.5.1. On L4T 32.3.1, we didn’t saw this behavior.

I would appreciate if someone could help me to solve this issue.

Thanks!

ShaneCCC · July 6, 2021, 9:18am

What’s the APP to reproduce the issue.

MarkusHess · July 6, 2021, 9:52am

Hi ShaneCCC,

unfortunately, I cannot share any code. We have two containers running. One of them (the crashing one) captures the images using libargus and provides the images over a unix socket. The second container reads the images from this socket and processes some neural networks. It seems that the crash does not occur if the second container is not running or at least it takes longer. I will talk to our customer (who is providing the second container) if we can share it with you.

ShaneCCC · July 7, 2021, 2:32am

Please check with any of sample code to check if can reproduce the issue then we can help on it.

zschutschke · August 3, 2021, 9:08am

Hi Shane,

I created a “minimal” example for the issue, stripping a lot of our code away. It still has the basic class structure and a comparable set up procedure. On our jetson, it crashes spuriously (sometimes after a few images sometimes after a couple of 100k). The stack trace is the same as Markus reported.

minimal.tbz2 (7.1 KB)

Please let me know if this is sufficient for reproducing the issue.

Best,

Axel

ShaneCCC · August 3, 2021, 12:31pm

@zschutschke
It’s would be better can reproduce by the multimedia API sample code. Otherwise you can narrow down to specific code cause the problem then I can report to developer for getting help on that.

sudo apt list -a nvidia-l4t-jetson-multimedia-api
sudo apt install nvidia-l4t-jetson-multimedia-api=32.5.xxxxxxx

zschutschke · August 11, 2021, 6:59am

Hi Shane,

just a little heads-up (also to keep this thread alive): I am still working on the issue. I tried the argus syncSensor example, as it also uses the CUeglStream* interfaces and multiple devices as well, but did not observe any issues. Now, I tried to reduce the number of active cameras in our minimal example from 3 to 2, which also fixes the problem (at least it did not crash for several days now). Next step would be to increase the number of cameras in the Argus sample and see what happens. I will report back, once I have new observations and nvidia sample based code.

Best,

Axel

system · October 10, 2021, 7:00am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.