Jetson AGX Xavier eqos ethernet driver causing kernel panic

I’m using an AGX Xavier with L4T 32.6.1 and running into a kernel panic on the eqos ethernet driver (eth0). I have a GigE camera connected to this interface configured to send jumbo packets at 4000 byte packet size. Unfortunately, the kernel panic occurs after a random amount of time and I can only catch the dmesg output sometimes before the system reboots (kern.log and syslog don’t contain the error, probably due to rebooting before file write flush). See the output I was able to capture below.

[  182.428773] ------------[ cut here ]------------
[  182.428786] kernel BUG at /root/trunk_t186_t194_32.6.1/Linux_for_Tegra/sources/kernel/kernel-4.9/mm/slub.c:3919!
[  182.429008] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  182.429132] Modules linked in: xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter can_raw can mttcan can_dev overlay zram userspace_alert nvgpu nfsd nfs_acl ip_tables x_tables
[  182.429895] CPU: 2 PID: 8660 Comm: arv_gv_stream Not tainted 4.9.253-tegra #1
[  182.430061] Hardware name: jetson-xavier (DT)
[  182.430116] ------------[ cut here ]------------
[  182.430134] WARNING: CPU: 0 PID: 3 at /root/trunk_t186_t194_32.6.1/Linux_for_Tegra/sources/kernel/nvidia/drivers/net/ethernet/nvidia/eqos/desc.c:387 desc_alloc_skb.isra.6+0x13c/0x1c8
[  182.430182] Modules linked in: xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter can_raw can mttcan can_dev overlay zram userspace_alert nvgpu nfsd nfs_acl ip_tables x_tables

[  182.430197] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.9.253-tegra #1
[  182.430199] Hardware name: jetson-xavier (DT)
[  182.430208] task: ffffffc7dc771c00 task.stack: ffffffc7dbc14000
[  182.430219] PC is at desc_alloc_skb.isra.6+0x13c/0x1c8
[  182.430222] LR is at eqos_re_alloc_skb+0x68/0x108
[  182.430225] pc : [<ffffff800891e4e4>] lr : [<ffffff800891e938>] pstate: 20c00045
[  182.430227] sp : ffffffc7dbc17b60
[  182.430242] x29: ffffffc7dbc17b60 x28: ffffffc7d7b08900
[  182.430247] x27: ffffffc7d7b0c000 x26: ffffffc7c7fb4900
[  182.430252] x25: 0000000002080020 x24: 0000000000000000
[  182.430257] x23: 000000005dd02042 x22: ffffffc7c7fb4848
[  182.430261] x21: ffffffc7c7fb4840 x20: ffffffc7d7b08900
[  182.430266] x19: ffffffc7ccb89c00 x18: 0000000000000400
[  182.430270] x17: 0000000000000002 x16: 0000000000000003
[  182.430275] x15: ffffffc7db23f028 x14: 0000000000000001
[  182.430279] x13: 0000000000000000 x12: 0000000000544846
[  182.430284] x11: ffffff80091b45f0 x10: ffffff8009873118
[  182.430294] x9 : 000000005df94000 x8 : 0000000000000001
[  182.430298] x7 : 0000000000757dbc x6 : 0000000000000000
[  182.430302] x5 : 0000000000000000 x4 : 0000000000000000
[  182.430307] x3 : 0000000002080020 x2 : ffffffc7c7fb4848
[  182.430311] x1 : ffffffc7c7fb4840 x0 : 0000000000000f84

[  182.430314] ---[ end trace c0853cce0ae8af66 ]---
[  182.430317] Call trace:
[  182.430325] [<ffffff800891e4e4>] desc_alloc_skb.isra.6+0x13c/0x1c8
[  182.430330] [<ffffff800891e938>] eqos_re_alloc_skb+0x68/0x108
[  182.430334] [<ffffff8008919974>] eqos_napi_poll_rx+0x2dc/0x4f8
[  182.430351] [<ffffff8008d989e4>] net_rx_action+0xf4/0x358
[  182.430360] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[  182.430365] [<ffffff80080b9db0>] run_ksoftirqd+0x48/0x58
[  182.430370] [<ffffff80080dfa38>] smpboot_thread_fn+0x160/0x248
[  182.430374] [<ffffff80080db09c>] kthread+0xec/0xf0
[  182.430377] [<ffffff80080838a0>] ret_from_fork+0x10/0x30
[  182.437481] ------------[ cut here ]------------
[  182.437488] kernel BUG at /root/trunk_t186_t194_32.6.1/Linux_for_Tegra/sources/kernel/kernel-4.9/net/core/skbuff.c:1444!
[  182.637287] task: ffffffc78ddeb800 task.stack: ffffffc7cb6b0000
[  182.643149] PC is at kfree+0x254/0x2a8
[  182.646994] LR is at skb_free_head+0x28/0x48
[  182.651197] pc : [<ffffff8008232e4c>] lr : [<ffffff8008d7e6b0>] pstate: 40400145
[  182.658633] sp : ffffffc7cb6b3b70
[  182.661958] x29: ffffffc7cb6b3b70 x28: ffffffc634925f00
[  182.667903] x27: 0000000000000f84 x26: 0000000000000f84
[  182.673501] x25: 0000000000000000 x24: 0000000000000000
[  182.679103] x23: 0000000000000040 x22: ffffffc6122aa000
[  182.684701] x21: ffffffc634925f00 x20: ffffff8008d7e6b0
[  182.690049] x19: ffffffbf1848aa80 x18: 000000000000032d
[  182.695990] x17: 0000007f48c2b030 x16: ffffff8008d775d8
[  182.701764] x15: 00001106f0000000 x14: 60ee826b07835afb
[  182.707452] x13: 8065e9946475746f x12: 8185712f775f6d80
[  182.712964] x11: 59be9165b1786680 x10: 8169327c681d8069
[  182.718565] x9 : a18c67d27c682878 x8 : 68ea7c65c47e6a62
[  182.724340] x7 : 000000000007d653 x6 : 0000007e60001aa4
[  182.729853] x5 : 0000007e60001aa4 x4 : 0000000000000004
[  182.734943] x3 : ffffffc6122aa000 x2 : 0000000000001ec0
[  182.740276] x1 : 0000000000000000 x0 : 0000000000000000

[  182.747023] Process arv_gv_stream (pid: 8660, stack limit = 0xffffffc7cb6b0000)
[  182.754179] Call trace:
[  182.756547] [<ffffff8008232e4c>] kfree+0x254/0x2a8
[  182.761097] [<ffffff8008d7e6b0>] skb_free_head+0x28/0x48
[  182.765911] [<ffffff8008d7efc8>] skb_release_data+0x100/0x130
[  182.771247] [<ffffff8008d7f028>] skb_release_all+0x30/0x40
[  182.776322] [<ffffff8008d7f058>] __kfree_skb+0x20/0x38
[  182.781395] [<ffffff8008d874c0>] __skb_free_datagram_locked+0x90/0x118
[  182.787260] [<ffffff8008e18314>] udp_recvmsg+0x354/0x630
[  182.792337] [<ffffff8008e25724>] inet_recvmsg+0xb4/0xd8
[  182.797407] [<ffffff8008d74a30>] sock_recvmsg+0x58/0x68
[  182.802223] [<ffffff8008d77680>] SyS_recvfrom+0xa8/0x120
[  182.807300] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[  182.812465] ---[ end trace c0853cce0ae8af67 ]---
[  182.825242] Internal error: Oops - BUG: 0 [#2] PREEMPT SMP

Interestingly, reducing the packet size sent by the camera to non-jumbo packets (I tried 1250 bytes) causes a kernel panic to occur more consistently over time, usually within a minute or two.

[  241.266669] ------------[ cut here ]------------
[  241.266678] kernel BUG at /root/trunk_t186_t194_32.6.1/Linux_for_Tegra/sources/kernel/kernel-4.9/mm/slub.c:3919!
[  241.266878] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  241.266978] Modules linked in: can_raw can mttcan can_dev xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram overlay userspace_alert nvgpu nfsd nfs_acl ip_tables x_tables
[  241.267611] CPU: 3 PID: 10337 Comm: arv_gv_stream Not tainted 4.9.253-tegra #1
[  241.267730] Hardware name: jetson-xavier (DT)
[  241.267809] task: ffffffc79a47f000 task.stack: ffffffc79a554000
[  241.267914] PC is at kfree+0x254/0x2a8
[  241.267985] LR is at skb_free_head+0x28/0x48
[  241.268195] pc : [<ffffff8008232e4c>] lr : [<ffffff8008d7e6b0>] pstate: 40400145
[  241.268758] sp : ffffffc79a557b70
[  241.269021] x29: ffffffc79a557b70 x28: ffffffc7c0647c00
[  241.269458] x27: 00000000000004c6 x26: 00000000000004c6
[  241.269889] x25: 0000000000000000 x24: 0000000000000000
[  241.274719] x23: 0000000000000040
[  241.275661] cache_from_obj: Wrong slab cache. kmalloc-256 but object is from UDP
[  241.275678] ------------[ cut here ]------------
[  241.275702] WARNING: CPU: 1 PID: 10323 at /root/trunk_t186_t194_32.6.1/Linux_for_Tegra/sources/kernel/kernel-4.9/mm/slab.h:354 kmem_cache_free+0x1cc/0x2e0
[  241.275704] Modules linked in:
[  241.275713]  can_raw
[  241.275717]  can
[  241.275719]  mttcan
[  241.275725]  can_dev
[  241.275727]  xt_conntrack
[  241.275728]  ipt_MASQUERADE
[  241.275731]  nf_nat_masquerade_ipv4
[  241.275732]  nf_conntrack_netlink
[  241.275734]  nfnetlink
[  241.275736]  xt_addrtype
[  241.275737]  iptable_filter
[  241.275739]  iptable_nat
[  241.275740]  nf_conntrack_ipv4
[  241.275742]  nf_defrag_ipv4
[  241.275743]  nf_nat_ipv4
[  241.275744]  nf_nat
[  241.275746]  nf_conntrack
[  241.275748]  br_netfilter
[  241.275749]  zram
[  241.275751]  overlay
[  241.275753]  userspace_alert

Any tips or help are appreciated.

Please check below thread to see if can help: High MTU causes Kernel Panic - #18 by k-hamada

Hi,

If you want to know the cause of kernel panic by taking the precise log, and if you can rebuild the kernel, there is a method to rewrite the macro ‘BUG_ON’ to ‘WARN_ON’ and prevent to fall the panic.

https://lore.kernel.org/lkml/Pine.LNX.4.64.0901141719560.12990@melkki.cs.Helsinki.FI/

But I recommend you to use kernel V5 instead of V4.

Hi, Mr. kayccc

Must I spend time with this debug?

It appears that the solution outlined in that link does appear to work. I was originally skeptical because the solution is a patch for 32.5 that the poster says to “wait for 32.6.1 for” (I’m running 32.6.1).

After looking at the kernel source for 32.6.1 and comparing with 32.5, it looks like 32.6.1 does not have the patch. I rebuilt the kernel to include the patch but ran into other issues (likely due to the carrier board mods I need). After upgrading to 32.7.2, this particular issue appears to be resolved. 32.7.2 kernel source does have the patch.

What carrier board are you currently using?