Crash of irq/39 (eth0) process

Hello,
We are using a Jetson NX with devkit carrier board with L4T 32.6.1 with RT kernel.
Exact kernel version is 4.9.253-tegra-rt168-tegra.
A gigabit ethernet camera is connected to eth0.
When acquiring at full speed we sometimes have a crash of the irq/39 process, here is the following trace:

[  270.435021] LR is at irq_thread_dtor+0x2c/0xd8
[  270.435023] pc : [<ffffff80080d7df4>] lr : [<ffffff800811c834>] pstate: 604003c5
[  270.435024] sp : ffffffc1ea0ab790
[  270.435030] x29: ffffffc1ea0ab790 x28: ffffffc1e7b11d80
[  270.435034] x27: ffffffc1f42bc000 x26: ffffffc1eff0cdf8
[  270.435042] x25: ffffff800a0c0178 x24: ffffff8009e95000
[  270.435046] x23: 00000000000003c0 x22: ffffff800a17d140
[  270.435050] x21: 0000000000000000 x20: ffffffc1e7b11d80
[  270.435057] x19: ffffffc1e7b11d80 x18: 0000000000000000
[  270.435061] x17: 0000000000000000 x16: 000000000000000a
[  270.435066] x15: ffffffffffffffff x14: ffffffc199ca1a58
[  270.435072] x13: ffffffc199ca1a54 x12: 0000000000000038
[  270.435076] x11: 0101010101010101 x10: 7f7f7f7f7f7fff7f
[  270.435082] x9 : fefefefeff04dc05 x8 : 7f7f7f7f7f7f7f7f
[  270.435086] x7 : ff09422c32323138 x6 : 0000000000000000
[  270.435093] x5 : 0000000000000000 x4 : ffffffc1e7b12638
[  270.435097] x3 : ffffffc1ea0abe10 x2 : 0000000000000000
[  270.435103] x1 : ffffff800811c808 x0 : 0000000000000000

[  270.435112] Process irq/39-2490000. (pid: 4327, stack limit = 0xffffffc1ea0a8000)
[  270.435114] Call trace:
[  270.435120] [<ffffff80080d7df4>] kthread_data+0x24/0x30
[  270.435125] [<ffffff80080d4efc>] task_work_run+0xbc/0xd8
[  270.435131] [<ffffff80080b3768>] do_exit+0x2e0/0xaa0
[  270.435137] [<ffffff800808bac4>] die+0x194/0x198
[  270.435144] [<ffffff800808bb10>] bug_handler.part.2+0x48/0x88
[  270.435149] [<ffffff800808bb8c>] bug_handler+0x3c/0x48
[  270.435155] [<ffffff8008083e14>] brk_handler+0x7c/0xd0
[  270.435161] [<ffffff8008080c4c>] do_debug_exception+0x84/0x120
[  270.435168] [<ffffff8008082264>] el1_dbg+0x18/0xac
[  270.435176] [<ffffff8008d53870>] skb_split+0x0/0x2f0
[  270.435186] [<ffffff80088df550>] eqos_napi_poll_rx+0x140/0x4f8
[  270.435192] [<ffffff8008d69570>] net_rx_action+0x188/0x3f8
[  270.435199] [<ffffff80080b530c>] do_current_softirqs+0x1cc/0x3c8
[  270.435203] [<ffffff80080b5564>] __local_bh_enable+0x5c/0x70
[  270.435207] [<ffffff800811c734>] irq_forced_thread_fn+0x7c/0xa8
[  270.435213] [<ffffff800811c9f0>] irq_thread+0x110/0x1b0
[  270.435220] [<ffffff80080d71d4>] kthread+0xec/0xf0
[  270.435223] [<ffffff80080830a0>] ret_from_fork+0x10/0x30
[  270.435644] ---[ end trace 0000000000000003 ]---

The crash is easily reproducible, would you have any ideas to further diagnose the issue?
Thank you very much for your help!

Hi,

Just to clarify the scenario. If we don’t use gigabit ethernet camera, will this issue be reproducible?

Hello,

I haven’t been able to reproduce the crash with iperf to generate gigabit bandwidth.

Please try to check if ethernet camera is the only way to reproduce this issue or not.

And whether RT patches are required. For example, maybe this issue was able to reproduced even without RT patches but just need the etherent camera?

Hi,

Please also try to use

NX → ethernet hub → ethernet camera case. This is how we verified the eth camer before.

Hello, I tried without the RT patch for the kernel, I still have a crash (see the dump below).
The gigabit ethernet camera is connected directly to eth0.

Thanks for your help!

[  102.552521] skbuff: skb_over_panic: text:ffffff80088e7e20 len:16370 put:8206 head:ffffffc17910c000 data:ffffffc17910c06c tail:0x405e end:0x3ec0 dev:<NULL>
[  102.552646] ------------[ cut here ]------------
[  102.552654] kernel BUG at /home/ubuntu/nvidia/nvidia_sdk/JetPack_4.6_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/core/skbuff.c:105!
[  102.552675] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[  102.552697] Modules linked in: fuse zram bnep rtk_btusb btusb btrtl btbcm btintel ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack vfat fat rtl8822ce userspace_alert cfg80211 nvgpu ip_tables x_tables
[  102.552941] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.253-tegra-tegra #1
[  102.552953] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[  102.552965] task: ffffff8009ec13c0 task.stack: ffffff8009eb0000
[  102.552984] PC is at skb_panic+0x68/0x70
[  102.552996] LR is at skb_panic+0x68/0x70
[  102.553007] pc : [<ffffff8008d5c380>] lr : [<ffffff8008d5c380>] pstate: 80400045
[  102.553018] sp : ffffffc1ffd26d70
[  102.553028] x29: ffffffc1ffd26d80 x28: ffffffc1e3ba0900
[  102.553057] x27: ffffffc1e3ba4000 x26: ffffffc1d8ba0018
[  102.553085] x25: ffffffc1a3528b00 x24: 000000003401200e
[  102.553110] x23: ffffff8008009010 x22: 000000000000200e
[  102.553136] x21: ffffff80088e7e20 x20: ffffff80091a1070
[  102.553161] x19: ffffffc1a3528b00 x18: 0000000000000010
[  102.553187] x17: 0000000000000002 x16: 0000000000000003
[  102.553212] x15: ffffffffffffffff x14: 3a76656420306365
[  102.553237] x13: 3378303a646e6520 x12: 6535303478303a6c
[  102.553262] x11: 6961742063363063 x10: 00000000000003f6
[  102.553289] x9 : 666666663a617461 x8 : ffffff80083a7e58
[  102.553326] x7 : ffffff8009f04258 x6 : ffffffc1ffd27bf0
[  102.553353] x5 : 0000000000000001 x4 : 0000000000000000
[  102.553378] x3 : ffffff8009ebb2c0 x2 : 0000000000000040
[  102.553404] x1 : ffffff8009ec13c0 x0 : 000000000000008e
[  102.553439] Process swapper/0 (pid: 0, stack limit = 0xffffff8009eb0000)
[  102.553450] Call trace:
[  102.553469] [<ffffff8008d5c380>] skb_panic+0x68/0x70
[  102.553484] [<ffffff8008d5e6e0>] skb_split+0x0/0x2f0
[  102.553499] [<ffffff80088e7e20>] eqos_napi_poll_rx+0x140/0x4f8
[  102.553516] [<ffffff8008d740dc>] net_rx_action+0xf4/0x358
[  102.553530] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[  102.553544] [<ffffff80080ba090>] irq_exit+0xd0/0x118
[  102.553559] [<ffffff8008120d64>] __handle_domain_irq+0x6c/0xc0
[  102.553570] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  102.553582] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  102.553596] [<ffffff8008b61338>] cpuidle_enter_state+0xb8/0x380
[  102.553607] [<ffffff8008b61674>] cpuidle_enter+0x34/0x48
[  102.553621] [<ffffff8008111184>] call_cpuidle+0x44/0x70
[  102.553632] [<ffffff8008111500>] cpu_startup_entry+0x1b0/0x200
[  102.553646] [<ffffff8008f2611c>] rest_init+0x84/0x90
[  102.553665] [<ffffff8009640b68>] start_kernel+0x374/0x38c
[  102.553678] [<ffffff8009640204>] __primary_switched+0x80/0x94
[  102.553698] ---[ end trace 083dea94416db81b ]---

Please also try our test case too.

Perhaps it doesn’t mean much, but the stack frame mentions a soft IRQ, not hard IRQ. Meaning it isn’t the actual ethernet hardware being served at the moment. More likely it is something dealing with protocol. If you can monitor “watch -n 1 ifconfig” from a serial console, then you might catch something with errors, overruns, so on (it only polls once per second so if it locks up at the wrong moment you won’t see anything). Various violations of a spec or other odd occurrence might be involved aside from the actual hardware.

I tried with L4T 32.7.1, I was unable to reproduce the issue.

Thanks for the help!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.