Kernel panic on Jetpack 4.6.1 Xavier NX

Hello, I am rarely experiencing a kernel panic on my jetson NX causing a card reboot. I was able to capture one call trace with the serial console:

[  318.293272] WARNING: CPU: 2 PID: 0 at /home/nvidia/nvidia/nvidia_sdk/JetPack_4.6.1_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/sched/sch_generic.c:316 dev_watchdog+0x2c8/0x2d0
[  318.293870] ---[ end trace f900c12b4190c6b7 ]---
[  318.294113] igb 0004:04:00.0 eth1: Reset adapter
[  328.445045] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[  328.445566] Kernel panic - not syncing: softlockup: hung tasks
[  328.445684] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O L  4.9.253-tegra #2
[  328.445826] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[  328.445953] Call trace:
[  328.446016] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
[  328.446124] [<ffffff800808c004>] show_stack+0x24/0x30
[  328.446229] [<ffffff8008f2e574>] dump_stack+0xa0/0xc4
[  328.446329] [<ffffff8008f2bba0>] panic+0x12c/0x2a8
[  328.446427] [<ffffff8008180ad0>] watchdog_unpark_threads+0x0/0x98
[  328.446545] [<ffffff8008138cb0>] __hrtimer_run_queues+0xd8/0x360
[  328.446654] [<ffffff8008139600>] hrtimer_interrupt+0xa8/0x1e0
[  328.446770] [<ffffff8008bca140>] arch_timer_handler_phys+0x38/0x58
[  328.446886] [<ffffff80081261b0>] handle_percpu_devid_irq+0x90/0x2b0
[  328.447238] [<ffffff8008120694>] generic_handle_irq+0x34/0x50
[  328.447673] [<ffffff8008120d80>] __handle_domain_irq+0x68/0xc0
[  328.448118] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  328.451684] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  328.456836] [<ffffff80080ba0b0>] irq_exit+0xd0/0x118
[  328.461383] [<ffffff8008120d84>] __handle_domain_irq+0x6c/0xc0
[  328.467505] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  328.472580] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  328.477654] [<ffffff8008b70058>] cpuidle_enter_state+0xb8/0x380
[  328.483688] [<ffffff8008b70394>] cpuidle_enter+0x34/0x48
[  328.489202] [<ffffff80081111a4>] call_cpuidle+0x44/0x70
[  328.494194] [<ffffff8008111520>] cpu_startup_entry+0x1b0/0x200
[  328.500402] [<ffffff8008f310f4>] rest_init+0x84/0x90
[  328.505136] [<ffffff80095f0b68>] start_kernel+0x374/0x38c
[  328.510901] [<ffffff80095f0204>] __primary_switched+0x80/0x94
[  328.516253] SMP: stopping secondary CPUs
[  328.520357] Kernel Offset: disabled
[  328.523676] Memory Limit: none
[  328.526827] trusty-log panic notifier - trusty version Built: 08:57:16 Feb 19 2022 [  328.547192] Rebooting in 5 seconds..
Shutdown state requested 1
Rebooting system ...

How can I further investigate in which area the CPU was stuck?

Thank you for your help!

I think IGB would refer to an Intel ethernet device. The watchdog timer part says an interrupt was issued for the device to service it, but failed to respond (it was a kernel space issue while servicing the driver). It is hard to say anything more useful. You might consider adding a serial console boot log (what happens prior to this matters since it sets up the environment the driver loads in to), along with the following:

  • Can you verify this is a dev kit, versus a module plus third party carrier board?
  • Which JetPack or L4T release is this?
  • If this is an SD card model, which SD card image is used, and was the Jetson itself flashed with that release (there is QSPI memory used in boot which would affect the Intel ethernet)?
  • If you have ever changed the device tree or kernel, then the nature of the change would be good to know (and if stock, then that too would be good to know).

Thanks for your reply!
We are using a carried board named “Boson for FRAMOS Carrier Board” with stock kernel and device tree from Jetpack 4.6.1.
It indeed contains a secondary i210 pcie ethernet port which was plugged in and in use before the crash occured.

I will try to upgrade to 4.6.2 and install their BSP in case they made any modifications to the kernel regarding this issue and report back with more information if the issue still occurs.

Thanks for your help!

If the third party carrier board has the same exact lane routing, then you won’t need a new device tree. However, if anything is different, then you will need a new device tree (which can affect the BMP). If the secondary i210 is related to a PCIe lane setup which is not an exact duplicate of the dev kit, then this too would cause a need for a device tree edit.