I’ve had two Kernel Oops in the last month or so in one of my devices. I have about 20 other devices running the same code and haven’t experienced this situation. In both cases it seems that there is an error when trying to access a virtual memory address, and the call trace is only that address. This doesn’t happen during boot or anything. just after some time and it seem to be rather rare/infrequent, considering these systems are ON and running 24/7.
These are Jetson TX2s running Ubuntu 18.04.6 LTS (GNU/Linux 4.9.201-tegra aarch64)
Output from dmesg -T:
Incident #1
[Mon Aug 7 23:15:44 2023] Unable to handle kernel paging request at virtual address ffffff80f3143d48
[Mon Aug 7 23:15:44 2023] Mem abort info:
[Mon Aug 7 23:15:44 2023] ESR = 0x86000005
[Mon Aug 7 23:15:44 2023] Exception class = IABT (current EL), IL = 32 bits
[Mon Aug 7 23:15:44 2023] SET = 0, FnV = 0
[Mon Aug 7 23:15:44 2023] EA = 0, S1PTW = 0
[Mon Aug 7 23:15:44 2023] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800a1fb000
[Mon Aug 7 23:15:44 2023] [ffffff80f3143d48] *pgd=0000000000000000, *pud=0000000000000000
[Mon Aug 7 23:15:44 2023] Internal error: Oops: 86000005 [#1] PREEMPT SMP
[Mon Aug 7 23:15:44 2023] Modules linked in: macvlan veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables xt_nat xt_tcpudp xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram smsc95xx ftdi_sio overlay cdc_acm bcmdhd cfg80211 userspace_alert nvgpu bluedroid_pm ip_tables x_tables
[Mon Aug 7 23:15:44 2023] CPU: 4 PID: 11080 Comm: activity_detect Not tainted 4.9.201-tegra #1
[Mon Aug 7 23:15:44 2023] Hardware name: quill (DT)
[Mon Aug 7 23:15:44 2023] task: ffffffc12598b800 task.stack: ffffffc0f3b30000
[Mon Aug 7 23:15:44 2023] PC is at 0xffffff80f3143d48
[Mon Aug 7 23:15:44 2023] LR is at 0xffffff80f3143d48
[Mon Aug 7 23:15:44 2023] pc : [<ffffff80f3143d48>] lr : [<ffffff80f3143d48>] pstate: 40400045
[Mon Aug 7 23:15:44 2023] sp : ffffffc0f3b33d60
[Mon Aug 7 23:15:44 2023] x29: 3dff00c000b30060 x28: ffffffc12598b800
[Mon Aug 7 23:15:44 2023] x27: ffffff8008f52000 x26: 0000000000000062
[Mon Aug 7 23:15:44 2023] x25: ffffffc0f3b33ea8 x24: 0000000000000009
[Mon Aug 7 23:15:44 2023] x23: 0000000000000000 x22: 0000000000000000
[Mon Aug 7 23:15:44 2023] x21: 0000000000000000 x20: 0000000080002102
[Mon Aug 7 23:15:44 2023] x19: ff00ff5508b4e278 x18: 0000000000000000
[Mon Aug 7 23:15:44 2023] x17: 0000000000000000 x16: 0000000000000000
[Mon Aug 7 23:15:44 2023] x15: 0000000000000000 x14: 0000000000620a38
[Mon Aug 7 23:15:44 2023] x13: 000000000000075c x12: 071c71c71c71c71c
[Mon Aug 7 23:15:44 2023] x11: 000000000000000b x10: 0000000000000a10
[Mon Aug 7 23:15:44 2023] x9 : ffffffc0f3b33b30 x8 : ffffffc12598c270
[Mon Aug 7 23:15:44 2023] x7 : fefefeff646c606d x6 : 0000000000000000
[Mon Aug 7 23:15:44 2023] x5 : 0000000000000000 x4 : ffffffc0f3b33cf8
[Mon Aug 7 23:15:44 2023] x3 : ffffffc1ebade188 x2 : ffffffc1f66c8140
[Mon Aug 7 23:15:44 2023] x1 : ffffffc1f66c81c0 x0 : 00000000ffffff92
[Mon Aug 7 23:15:44 2023] Process activity_detect (pid: 11080, stack limit = 0xffffffc0f3b30000)
[Mon Aug 7 23:15:44 2023] Call trace:
[Mon Aug 7 23:15:44 2023] [<ffffff80f3143d48>] 0xffffff80f3143d48
[Mon Aug 7 23:15:44 2023] ---[ end trace 100c92822440d0e8 ]---
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:57 2023] net_ratelimit: 1299160 callbacks suppressed
for some time, both before and after the kernel Oops. Could be related or not.
Any tips or suggestions on how to debug this further are appreciated. Thank you everyone in advance.
We have investigated further and we are fairly confident that the two issues aren’t related. Not only do they happen independently, we also have the kevent 2 issue in many other systems that do not experience the Kernel Oops.
Do you have any clues on what could be the cause for that or anything that we can do to investigate further? We will be investigating the kevent 2 issue in parallel so here I wouldn’t like to focus on the memory problem.
The address is virtual, and likely user space (at least the trigger is likely user space). Unfortunately, the backtrace/stack frame looks stripped, but a PID is given. At the time of failure it is likely the PID is killed, but you might get lucky if this is repeatable and be able to get a log of PIDs, and then match the killed PID to what was running. This would be your best clue. Example to log PIDs: ps aux 2>&1 | tee log_pid.txt
Regarding the first comment, unfortunately I checked and in none of the cases do we have error messages above. I know the processes that were trying to get this message (record or activity) and they are just ROS nodes that needs memory like many others that we have. They are fairly simple and running without issues in many other systems, so it seems more like a system issue and less of a problem with those nodes.
also @linuxdev correct me if I’m wrong but I believe that this is kernel space given the address no? Unfortunately it’s not repeatable easily and the PIDs are for different processes that shouldn’t be the issue as I mentioned before in this comment.
This is occurring in the kernel, but notice this excerpt: Unable to handle kernel paging request at virtual address
That means the address is not physical. Typically, when a driver works with a device, it uses a physical address. There is conversion between physical and virtual in many cases, but because it is talking about virtual, I would expect this originated (triggered) due to something in user space. This makes me think it is a data driven error, and not directly related to the driver.
I will investigate further (I was on vacation for the last 2 weeks).
In the meantime another error occured
Oct 01 01:00:03 hive41 kernel: Unable to handle kernel paging request at virtual address eed5200000090404
Oct 01 01:00:03 hive41 kernel: Mem abort info:
Oct 01 01:00:03 hive41 kernel: ESR = 0x96000004
Oct 01 01:00:03 hive41 kernel: Exception class = DABT (current EL), IL = 32 bits
Oct 01 01:00:03 hive41 kernel: SET = 0, FnV = 0
Oct 01 01:00:03 hive41 kernel: EA = 0, S1PTW = 0
Oct 01 01:00:03 hive41 kernel: Data abort info:
Oct 01 01:00:03 hive41 kernel: ISV = 0, ISS = 0x00000004
Oct 01 01:00:03 hive41 kernel: CM = 0, WnR = 0
Oct 01 01:00:03 hive41 kernel: [eed5200000090404] address between user and kernel address ranges
Oct 01 01:00:03 hive41 kernel: Internal error: Oops: 96000004 [#1] PREEMPT SMP
Oct 01 01:00:03 hive41 kernel: Modules linked in: macvlan veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip
Oct 01 01:00:03 hive41 kernel: CPU: 0 PID: 27109 Comm: prometheus Not tainted 4.9.201-tegra #1
Oct 01 01:00:03 hive41 kernel: Hardware name: quill (DT)
Oct 01 01:00:03 hive41 kernel: task: ffffffc1be853800 task.stack: ffffffc1cf01c000
Oct 01 01:00:03 hive41 kernel: PC is at futex_wake+0xcc/0x170
Oct 01 01:00:03 hive41 kernel: LR is at futex_wake+0xac/0x170
Oct 01 01:00:03 hive41 kernel: pc : [<ffffff800814c29c>] lr : [<ffffff800814c27c>] pstate: a0400145
Oct 01 01:00:03 hive41 kernel: sp : ffffffc1cf01fce0
Oct 01 01:00:03 hive41 kernel: x29: ffffffc1cf01fce0 x28: ffffffc1be853800
Oct 01 01:00:03 hive41 kernel: x27: ffffff8008f52000 x26: 0000000000000062
Oct 01 01:00:03 hive41 kernel: x25: ffffffc1ebaa7c84 x24: ffffffc1ebaa7c88
Oct 01 01:00:03 hive41 kernel: x23: 0000000000000001 x22: 00000000ffffffff
Oct 01 01:00:03 hive41 kernel: x21: 0000000000000000 x20: ffffffc1cf01fd38
Oct 01 01:00:03 hive41 kernel: x19: eed52000000903ec x18: 0000000000000000
Oct 01 01:00:03 hive41 kernel: x17: 0000004000591670 x16: ffffff800814e7f8
Oct 01 01:00:03 hive41 kernel: x15: 00001acfca174326 x14: 00110bf9e17968b8
Oct 01 01:00:03 hive41 kernel: x13: 000000006518c493 x12: 0000000000000017
Oct 01 01:00:03 hive41 kernel: x11: 00000000275a5e37 x10: 0000000000162a9b
Oct 01 01:00:03 hive41 kernel: x9 : 0000000000000004 x8 : 0000000000000062
Oct 01 01:00:03 hive41 kernel: x7 : ffffff8009e87f50 x6 : ffffff8009e87000
Oct 01 01:00:03 hive41 kernel: x5 : 00000000a8b01bdb x4 : 00000000000007ff
Oct 01 01:00:03 hive41 kernel: x3 : ffffffc0cc4ffc28 x2 : eed52000000903ec
Oct 01 01:00:03 hive41 kernel: x1 : f0f0000034270000 x0 : eed52000000903ec
Oct 01 01:00:03 hive41 kernel:
Oct 01 01:00:03 hive41 kernel: Process prometheus (pid: 27109, stack limit = 0xffffffc1cf01c000)
Oct 01 01:00:03 hive41 kernel: Call trace:
Oct 01 01:00:03 hive41 kernel: [<ffffff800814c29c>] futex_wake+0xcc/0x170
Oct 01 01:00:03 hive41 kernel: [<ffffff800814e1a0>] do_futex+0x260/0x8b8
Oct 01 01:00:03 hive41 kernel: [<ffffff800814e8c0>] SyS_futex+0xc8/0x168
Oct 01 01:00:03 hive41 kernel: [<ffffff800808395c>] __sys_trace_return+0x0/0x4
Oct 01 01:00:03 hive41 kernel: ---[ end trace 079a1223eb1fa482 ]---
Oct 01 01:00:03 hive41 kernel: note: prometheus[27109] exited with preempt_count 1
prometheus is one of the docker containers that we have running on this system. At least now we have a Call trace but I suspect it will reveal much considering that we have the same software running in dozens of other systems without much help.
We will have someone with physical access to this system later this week. Is there anything that we can do to either investigate this issue further or possible solve it? I another device I would reseat the memory sticks but obviously this doesn’t apply here.
There isn’t a lot that can be said about this. I’ll add some notes, but they’re probably not too useful…
A futex tends to be something user space works with for efficiency when dealing with threads or processes when something has to be atomic. If the stack frame were just manipulating the futex wait queue, then system calls are not used. However, we see that this is from a wake-up, so this is when the futex has a system call from user space to kernel space. Absolutely nothing is said about what it is that is doing the wake-up, nor about what it was in the futex within kernel space that failed to wake.
I don’t really do much with containers, but I could see the possibility from the above that the container is failing to grant access to the particular system call that wakes something up. That could be something inside the container, or it could be between container and host kernel. Short of getting a PID or at least some hint at the program triggering the failure (note that the trigger isn’t necessarily the cause, especially if it is a container issue), about the only thing you could do is to randomly run everything in user space via strace. You’d fill up a petabyte hard drive with system call messages in very short order, so I wouldn’t consider this practical unless you can narrow it down a lot.
I suppose if someone had proper JTAG or DSTREAM debugging you could go right to the problem. However, those are not something accessible. For what reason are you using a container? The container itself might not be configured correctly to allow some system call, or the container might have a bug. It’s really difficult to say more than “it’s broken” (which you already know).