Unable to handle kernel paging request at virtual address

marques.ricardo17 · September 12, 2023, 10:17am

I’ve had two Kernel Oops in the last month or so in one of my devices. I have about 20 other devices running the same code and haven’t experienced this situation. In both cases it seems that there is an error when trying to access a virtual memory address, and the call trace is only that address. This doesn’t happen during boot or anything. just after some time and it seem to be rather rare/infrequent, considering these systems are ON and running 24/7.

These are Jetson TX2s running Ubuntu 18.04.6 LTS (GNU/Linux 4.9.201-tegra aarch64)

Output from dmesg -T:

Incident #1

[Mon Aug  7 23:15:44 2023] Unable to handle kernel paging request at virtual address ffffff80f3143d48
[Mon Aug  7 23:15:44 2023] Mem abort info:
[Mon Aug  7 23:15:44 2023]   ESR = 0x86000005
[Mon Aug  7 23:15:44 2023]   Exception class = IABT (current EL), IL = 32 bits
[Mon Aug  7 23:15:44 2023]   SET = 0, FnV = 0
[Mon Aug  7 23:15:44 2023]   EA = 0, S1PTW = 0
[Mon Aug  7 23:15:44 2023] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800a1fb000
[Mon Aug  7 23:15:44 2023] [ffffff80f3143d48] *pgd=0000000000000000, *pud=0000000000000000
[Mon Aug  7 23:15:44 2023] Internal error: Oops: 86000005 [#1] PREEMPT SMP
[Mon Aug  7 23:15:44 2023] Modules linked in: macvlan veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables xt_nat xt_tcpudp xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram smsc95xx ftdi_sio overlay cdc_acm bcmdhd cfg80211 userspace_alert nvgpu bluedroid_pm ip_tables x_tables
[Mon Aug  7 23:15:44 2023] CPU: 4 PID: 11080 Comm: activity_detect Not tainted 4.9.201-tegra #1
[Mon Aug  7 23:15:44 2023] Hardware name: quill (DT)
[Mon Aug  7 23:15:44 2023] task: ffffffc12598b800 task.stack: ffffffc0f3b30000
[Mon Aug  7 23:15:44 2023] PC is at 0xffffff80f3143d48
[Mon Aug  7 23:15:44 2023] LR is at 0xffffff80f3143d48
[Mon Aug  7 23:15:44 2023] pc : [<ffffff80f3143d48>] lr : [<ffffff80f3143d48>] pstate: 40400045
[Mon Aug  7 23:15:44 2023] sp : ffffffc0f3b33d60
[Mon Aug  7 23:15:44 2023] x29: 3dff00c000b30060 x28: ffffffc12598b800 
[Mon Aug  7 23:15:44 2023] x27: ffffff8008f52000 x26: 0000000000000062 
[Mon Aug  7 23:15:44 2023] x25: ffffffc0f3b33ea8 x24: 0000000000000009 
[Mon Aug  7 23:15:44 2023] x23: 0000000000000000 x22: 0000000000000000 
[Mon Aug  7 23:15:44 2023] x21: 0000000000000000 x20: 0000000080002102 
[Mon Aug  7 23:15:44 2023] x19: ff00ff5508b4e278 x18: 0000000000000000 
[Mon Aug  7 23:15:44 2023] x17: 0000000000000000 x16: 0000000000000000 
[Mon Aug  7 23:15:44 2023] x15: 0000000000000000 x14: 0000000000620a38 
[Mon Aug  7 23:15:44 2023] x13: 000000000000075c x12: 071c71c71c71c71c 
[Mon Aug  7 23:15:44 2023] x11: 000000000000000b x10: 0000000000000a10 
[Mon Aug  7 23:15:44 2023] x9 : ffffffc0f3b33b30 x8 : ffffffc12598c270 
[Mon Aug  7 23:15:44 2023] x7 : fefefeff646c606d x6 : 0000000000000000 
[Mon Aug  7 23:15:44 2023] x5 : 0000000000000000 x4 : ffffffc0f3b33cf8 
[Mon Aug  7 23:15:44 2023] x3 : ffffffc1ebade188 x2 : ffffffc1f66c8140 
[Mon Aug  7 23:15:44 2023] x1 : ffffffc1f66c81c0 x0 : 00000000ffffff92 
[Mon Aug  7 23:15:44 2023] Process activity_detect (pid: 11080, stack limit = 0xffffffc0f3b30000)
[Mon Aug  7 23:15:44 2023] Call trace:
[Mon Aug  7 23:15:44 2023] [<ffffff80f3143d48>] 0xffffff80f3143d48
[Mon Aug  7 23:15:44 2023] ---[ end trace 100c92822440d0e8 ]---

Incident #2

[Fri Sep  1 01:56:10 2023] Unable to handle kernel paging request at virtual address ffffff804114fc7c
[Fri Sep  1 01:56:10 2023] Mem abort info:
[Fri Sep  1 01:56:10 2023]   ESR = 0x86000005
[Fri Sep  1 01:56:10 2023]   Exception class = IABT (current EL), IL = 32 bits
[Fri Sep  1 01:56:10 2023]   SET = 0, FnV = 0
[Fri Sep  1 01:56:10 2023]   EA = 0, S1PTW = 0
[Fri Sep  1 01:56:10 2023] swapper pgtable: 4k pages, 39-bit VAs, pgd = ffffff800a1fb000
[Fri Sep  1 01:56:10 2023] [ffffff804114fc7c] *pgd=0000000000000000, *pud=0000000000000000
[Fri Sep  1 01:56:10 2023] Internal error: Oops: 86000005 [#2] PREEMPT SMP
[Fri Sep  1 01:56:10 2023] Modules linked in: macvlan veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables xt_nat xt_tcpudp xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram smsc95xx ftdi_sio overlay cdc_acm bcmdhd cfg80211 userspace_alert nvgpu bluedroid_pm ip_tables x_tables
[Fri Sep  1 01:56:10 2023] CPU: 2 PID: 6843 Comm: record Tainted: G      D         4.9.201-tegra #1
[Fri Sep  1 01:56:10 2023] Hardware name: quill (DT)
[Fri Sep  1 01:56:10 2023] task: ffffffc125812a00 task.stack: ffffffc041b6c000
[Fri Sep  1 01:56:10 2023] PC is at 0xffffff804114fc7c
[Fri Sep  1 01:56:10 2023] LR is at 0xffffff804114fc7c
[Fri Sep  1 01:56:10 2023] pc : [<ffffff804114fc7c>] lr : [<ffffff804114fc7c>] pstate: 80400045
[Fri Sep  1 01:56:10 2023] sp : ffffffc041b6fc20
[Fri Sep  1 01:56:10 2023] x29: fcff00c000b60020 x28: ffffffc125812a00 
[Fri Sep  1 01:56:10 2023] x27: ffffffc041b6fea8 x26: 00000000ffffffff 
[Fri Sep  1 01:56:10 2023] x25: ffffffc041b6fd38 x24: 00000055b4761d9c 
[Fri Sep  1 01:56:10 2023] x23: 0000000000000000 x22: 0000000000000002 
[Fri Sep  1 01:56:10 2023] x21: 0000000000000000 x20: ffffffc025b62a90 
[Fri Sep  1 01:56:10 2023] x19: ffffffc10881c000 x18: 0000007fda0eb14c 
[Fri Sep  1 01:56:10 2023] x17: 0000000000000003 x16: 0000000000000000 
[Fri Sep  1 01:56:10 2023] x15: 0000000000000000 x14: 001c36f39ad7070a 
[Fri Sep  1 01:56:10 2023] x13: 00000000000002fb x12: 071c71c71c71c71c 
[Fri Sep  1 01:56:10 2023] x11: 000000000000000b x10: 0000000000000a10 
[Fri Sep  1 01:56:10 2023] x9 : ffffffc041b6fb30 x8 : ffffffc125813470 
[Fri Sep  1 01:56:10 2023] x7 : fefefeff646c606d x6 : 0000b0806a4c60d8 
[Fri Sep  1 01:56:10 2023] x5 : 0000000000000000 x4 : 0000000000000000 
[Fri Sep  1 01:56:10 2023] x3 : 0000000000000000 x2 : 0000000000000000 
[Fri Sep  1 01:56:10 2023] x1 : ffffffc081d85fcc x0 : ffffffc125812a00 

[Fri Sep  1 01:56:10 2023] Process record (pid: 6843, stack limit = 0xffffffc041b6c000)
[Fri Sep  1 01:56:10 2023] Call trace:
[Fri Sep  1 01:56:10 2023] [<ffffff804114fc7c>] 0xffffff804114fc7c
[Fri Sep  1 01:56:10 2023] ---[ end trace 100c92822440d0e9 ]---

In both cases I also have a lot of

[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:52 2023] smsc95xx 1-2.2:1.0 eth1: kevent 2 may have been dropped
[Mon Sep 11 09:13:57 2023] net_ratelimit: 1299160 callbacks suppressed

for some time, both before and after the kernel Oops. Could be related or not.

Any tips or suggestions on how to debug this further are appreciated. Thank you everyone in advance.

ShaneCCC · September 13, 2023, 4:27am

Try to disable the smsc95xx to clarify if relative.

marques.ricardo17 · September 13, 2023, 9:08am

Hello Shane and thank you for your message.

We have investigated further and we are fairly confident that the two issues aren’t related. Not only do they happen independently, we also have the kevent 2 issue in many other systems that do not experience the Kernel Oops.

Do you have any clues on what could be the cause for that or anything that we can do to investigate further? We will be investigating the kevent 2 issue in parallel so here I wouldn’t like to focus on the memory problem.

WayneWWW · September 13, 2023, 11:55am

Hi,

Please try to locate which driver is printing this log. Check what is the driver right before this got printed.

“[Fri Sep 1 01:56:10 2023] Unable to handle kernel paging request at virtual address ffffff804114fc7c”

linuxdev · September 13, 2023, 12:49pm

The address is virtual, and likely user space (at least the trigger is likely user space). Unfortunately, the backtrace/stack frame looks stripped, but a PID is given. At the time of failure it is likely the PID is killed, but you might get lucky if this is repeatable and be able to get a log of PIDs, and then match the killed PID to what was running. This would be your best clue. Example to log PIDs:
ps aux 2>&1 | tee log_pid.txt

Then match the kernel message PID to the PID log.

marques.ricardo17 · September 14, 2023, 3:54pm

Hello @WayneWWW and @linuxdev and thank you for your replies.

Regarding the first comment, unfortunately I checked and in none of the cases do we have error messages above. I know the processes that were trying to get this message (record or activity) and they are just ROS nodes that needs memory like many others that we have. They are fairly simple and running without issues in many other systems, so it seems more like a system issue and less of a problem with those nodes.

also @linuxdev correct me if I’m wrong but I believe that this is kernel space given the address no? Unfortunately it’s not repeatable easily and the PIDs are for different processes that shouldn’t be the issue as I mentioned before in this comment.

Any other thoughts? :\

linuxdev · September 14, 2023, 4:10pm

This is occurring in the kernel, but notice this excerpt:
Unable to handle kernel paging request at virtual address

That means the address is not physical. Typically, when a driver works with a device, it uses a physical address. There is conversion between physical and virtual in many cases, but because it is talking about virtual, I would expect this originated (triggered) due to something in user space. This makes me think it is a data driven error, and not directly related to the driver.

marques.ricardo17 · October 2, 2023, 4:04pm

I see @linuxdev

I will investigate further (I was on vacation for the last 2 weeks).

In the meantime another error occured

Oct 01 01:00:03 hive41 kernel: Unable to handle kernel paging request at virtual address eed5200000090404
Oct 01 01:00:03 hive41 kernel: Mem abort info:
Oct 01 01:00:03 hive41 kernel:   ESR = 0x96000004
Oct 01 01:00:03 hive41 kernel:   Exception class = DABT (current EL), IL = 32 bits
Oct 01 01:00:03 hive41 kernel:   SET = 0, FnV = 0
Oct 01 01:00:03 hive41 kernel:   EA = 0, S1PTW = 0
Oct 01 01:00:03 hive41 kernel: Data abort info:
Oct 01 01:00:03 hive41 kernel:   ISV = 0, ISS = 0x00000004
Oct 01 01:00:03 hive41 kernel:   CM = 0, WnR = 0
Oct 01 01:00:03 hive41 kernel: [eed5200000090404] address between user and kernel address ranges
Oct 01 01:00:03 hive41 kernel: Internal error: Oops: 96000004 [#1] PREEMPT SMP
Oct 01 01:00:03 hive41 kernel: Modules linked in: macvlan veth ip6t_MASQUERADE nf_nat_masquerade_ipv6 xt_mark ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip
Oct 01 01:00:03 hive41 kernel: CPU: 0 PID: 27109 Comm: prometheus Not tainted 4.9.201-tegra #1
Oct 01 01:00:03 hive41 kernel: Hardware name: quill (DT)
Oct 01 01:00:03 hive41 kernel: task: ffffffc1be853800 task.stack: ffffffc1cf01c000
Oct 01 01:00:03 hive41 kernel: PC is at futex_wake+0xcc/0x170
Oct 01 01:00:03 hive41 kernel: LR is at futex_wake+0xac/0x170
Oct 01 01:00:03 hive41 kernel: pc : [<ffffff800814c29c>] lr : [<ffffff800814c27c>] pstate: a0400145
Oct 01 01:00:03 hive41 kernel: sp : ffffffc1cf01fce0
Oct 01 01:00:03 hive41 kernel: x29: ffffffc1cf01fce0 x28: ffffffc1be853800 
Oct 01 01:00:03 hive41 kernel: x27: ffffff8008f52000 x26: 0000000000000062 
Oct 01 01:00:03 hive41 kernel: x25: ffffffc1ebaa7c84 x24: ffffffc1ebaa7c88 
Oct 01 01:00:03 hive41 kernel: x23: 0000000000000001 x22: 00000000ffffffff 
Oct 01 01:00:03 hive41 kernel: x21: 0000000000000000 x20: ffffffc1cf01fd38 
Oct 01 01:00:03 hive41 kernel: x19: eed52000000903ec x18: 0000000000000000 
Oct 01 01:00:03 hive41 kernel: x17: 0000004000591670 x16: ffffff800814e7f8 
Oct 01 01:00:03 hive41 kernel: x15: 00001acfca174326 x14: 00110bf9e17968b8 
Oct 01 01:00:03 hive41 kernel: x13: 000000006518c493 x12: 0000000000000017 
Oct 01 01:00:03 hive41 kernel: x11: 00000000275a5e37 x10: 0000000000162a9b 
Oct 01 01:00:03 hive41 kernel: x9 : 0000000000000004 x8 : 0000000000000062 
Oct 01 01:00:03 hive41 kernel: x7 : ffffff8009e87f50 x6 : ffffff8009e87000 
Oct 01 01:00:03 hive41 kernel: x5 : 00000000a8b01bdb x4 : 00000000000007ff 
Oct 01 01:00:03 hive41 kernel: x3 : ffffffc0cc4ffc28 x2 : eed52000000903ec 
Oct 01 01:00:03 hive41 kernel: x1 : f0f0000034270000 x0 : eed52000000903ec 
Oct 01 01:00:03 hive41 kernel: 
Oct 01 01:00:03 hive41 kernel: Process prometheus (pid: 27109, stack limit = 0xffffffc1cf01c000)
Oct 01 01:00:03 hive41 kernel: Call trace:
Oct 01 01:00:03 hive41 kernel: [<ffffff800814c29c>] futex_wake+0xcc/0x170
Oct 01 01:00:03 hive41 kernel: [<ffffff800814e1a0>] do_futex+0x260/0x8b8
Oct 01 01:00:03 hive41 kernel: [<ffffff800814e8c0>] SyS_futex+0xc8/0x168
Oct 01 01:00:03 hive41 kernel: [<ffffff800808395c>] __sys_trace_return+0x0/0x4
Oct 01 01:00:03 hive41 kernel: ---[ end trace 079a1223eb1fa482 ]---
Oct 01 01:00:03 hive41 kernel: note: prometheus[27109] exited with preempt_count 1

prometheus is one of the docker containers that we have running on this system. At least now we have a Call trace but I suspect it will reveal much considering that we have the same software running in dozens of other systems without much help.

We will have someone with physical access to this system later this week. Is there anything that we can do to either investigate this issue further or possible solve it? I another device I would reseat the memory sticks but obviously this doesn’t apply here.

Thank you once again for all the help.

linuxdev · October 2, 2023, 8:26pm

There isn’t a lot that can be said about this. I’ll add some notes, but they’re probably not too useful…

A futex tends to be something user space works with for efficiency when dealing with threads or processes when something has to be atomic. If the stack frame were just manipulating the futex wait queue, then system calls are not used. However, we see that this is from a wake-up, so this is when the futex has a system call from user space to kernel space. Absolutely nothing is said about what it is that is doing the wake-up, nor about what it was in the futex within kernel space that failed to wake.

I don’t really do much with containers, but I could see the possibility from the above that the container is failing to grant access to the particular system call that wakes something up. That could be something inside the container, or it could be between container and host kernel. Short of getting a PID or at least some hint at the program triggering the failure (note that the trigger isn’t necessarily the cause, especially if it is a container issue), about the only thing you could do is to randomly run everything in user space via strace. You’d fill up a petabyte hard drive with system call messages in very short order, so I wouldn’t consider this practical unless you can narrow it down a lot.

I suppose if someone had proper JTAG or DSTREAM debugging you could go right to the problem. However, those are not something accessible. For what reason are you using a container? The container itself might not be configured correctly to allow some system call, or the container might have a bug. It’s really difficult to say more than “it’s broken” (which you already know).

system · October 16, 2023, 8:26pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kernel paging fault? Jetson TX2 kernel	8	3200	October 18, 2021
Does the Jetson TX2 eMMC report correct life time estimation at runtime? Jetson TX2	10	1315	November 26, 2021
Kernel panic on Jetpack 4.6.1 Xavier NX Jetson Xavier NX kernel	17	1390	October 27, 2022
Jetson crashes Jetson TX2	10	1199	August 11, 2023
High MTU causes Kernel Panic Jetson AGX Xavier kernel	24	2264	June 28, 2022
kernel randomly IABT error Jetson TX2	3	1042	December 23, 2018
PCI driver kernel module load problem Jetson TX2	4	1863	October 18, 2021
Disabled serial debug console suspend/resume problem Jetson TX2 uart , nvbugs	52	1846	October 18, 2021
Kernel Crashing Following Upgrade to Jetpack 5.1/5.1.1 Jetson AGX Xavier boot	35	1015	April 2, 2024
Jetson TX2 HDMI and USB functionality with custom carrire board Jetson TX2	11	1709	October 18, 2021

Unable to handle kernel paging request at virtual address

Related topics