The device experienced a Kernel panic during operation, leading to reboot

Hi,
We have identified some devices experiencing inexplicable reboots during operation. We have captured crash log files in the /sys/fs/pstore directory. Please assist in analyzing the cause of these reboots. Thank you.
Module:NVIDIA Jetson Xavier AGX,JetPack4.6
console-ramoops-0.txt (9.2 KB)
dmesg-ramoops-0.txt (106.6 KB)
dmesg-ramoops-1.txt (106.6 KB)

Could you test jetpack4.6.4 as the kernel version has been updated?

Could you please analyze the logs to identify the cause of the reboots? We will also upgrade Jetpack for validation.

No, we cannot identify the cause of reboot.

I see ethernet driver has some log printed. But I don’t know if this is really caused by ethernet or something else leading to this issue.

The better way here is you should try to reproduce this on NV devkit so that we can check further.

I do see this in one of the logs:

<3>[10320.507534] Memory cgroup out of memory: Kill process 11456 (yurthub) score 0 or sacrifice child
<3>[10320.507814] Killed process 11456 (yurthub) total-vm:1026144kB, anon-rss:220132kB, file-rss:0kB, shmem-rss:0kB
<6>[10320.531062] oom_reaper: reaped process 11456 (yurthub), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
<4>[10374.168900] yurthub invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=-998
<6>[10374.169116] yurthub cpuset=docker-1c1c98065b3e3033b0f97bcedff74f52fdb61a4bdcb29dbff8fd7635a9755598.scope mems_allowed=0
<6>[10374.169140] CPU: 6 PID: 58893 Comm: yurthub Not tainted 4.9.253-tegra #5
<6>[10374.169143] Hardware name: Jetson-AGX (DT)
<6>[10374.169147] Call trace:
<6>[10374.169159] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
<6>[10374.169164] [<ffffff800808c004>] show_stack+0x24/0x30
<6>[10374.169171] [<ffffff8008f87814>] dump_stack+0xa0/0xc4
<6>[10374.169176] [<ffffff8008f8564c>] dump_header+0x6c/0x1b8
<6>[10374.169181] [<ffffff80081c8fc4>] oom_kill_process+0x29c/0x4c8
<6>[10374.169188] [<ffffff80081c969c>] out_of_memory+0x1e4/0x308
<6>[10374.169194] [<ffffff80082478a0>] mem_cgroup_out_of_memory+0x50/0x70
<6>[10374.169199] [<ffffff800824dc0c>] mem_cgroup_oom_synchronize+0x36c/0x3b8
<6>[10374.169203] [<ffffff80081c97e8>] pagefault_out_of_memory+0x28/0x78
<6>[10374.169208] [<ffffff80080a2688>] do_page_fault+0x470/0x480
<6>[10374.169212] [<ffffff80080a2704>] do_translation_fault+0x6c/0x80
<6>[10374.169217] [<ffffff8008080954>] do_mem_abort+0x54/0xb0
<6>[10374.169220] [<ffffff8008083408>] el0_da+0x20/0x24
<6>[10374.169243] Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod206d390a4672e8fd71e8d4ed7d97e36c.slice/docker-7291c8ae34289ec948b0e622b3966c3d58551617d7f71eb1e45894afdd442d0b.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod206d390a4672e8fd71e8d4ed7d97e36c.slice
<6>[10374.169262] memory: usage 307200kB, limit 307200kB, failcnt 102733
<6>[10374.169266] memory+swap: usage 307200kB, limit 9007199254740988kB, failcnt 0
<6>[10374.169269] kmem: usage 4896kB, limit 9007199254740988kB, failcnt 0
<6>[10374.169272] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod206d390a4672e8fd71e8d4ed7d97e36c.slice: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB

Any time you see the oom_kill_process (which occurs to self-protect the system when you see out_of_memory) the system has used up all of its RAM (and probably VRAM where that applies) and is trying to kill a user space process in order to survive not having enough RAM. Apparently all the memory is used up, and the OOM process kill is failing. I don’t know what Yurthub is, but this was one of the things failing OOM kill. The fact that the CPU was in exception level 0 (el0) implies it was user space and not a kernel driver.

This might or might not be related, but most user space apps don’t have any problem with virtual memory (implying swap). On the other hand, there are many kernel space programs (drivers, they run in exception level 1, kernel space) which can only use physical memory and cannot use swap. The GPU is the most important example of that. Swap can help in some cases, but not with others.

Maybe something failed with swap, but it appears that at least the Yurthub was unkillable due to something unusual in how it uses memory. I think docker was involved, but I can’t really say what that relation is.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.