Kernel panic on Jetpack 4.6.1 Xavier NX

Hello, I am rarely experiencing a kernel panic on my jetson NX causing a card reboot. I was able to capture one call trace with the serial console:

[  318.293272] WARNING: CPU: 2 PID: 0 at /home/nvidia/nvidia/nvidia_sdk/JetPack_4.6.1_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/sched/sch_generic.c:316 dev_watchdog+0x2c8/0x2d0
[  318.293870] ---[ end trace f900c12b4190c6b7 ]---
[  318.294113] igb 0004:04:00.0 eth1: Reset adapter
[  328.445045] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[  328.445566] Kernel panic - not syncing: softlockup: hung tasks
[  328.445684] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O L  4.9.253-tegra #2
[  328.445826] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[  328.445953] Call trace:
[  328.446016] [<ffffff800808ba40>] dump_backtrace+0x0/0x198
[  328.446124] [<ffffff800808c004>] show_stack+0x24/0x30
[  328.446229] [<ffffff8008f2e574>] dump_stack+0xa0/0xc4
[  328.446329] [<ffffff8008f2bba0>] panic+0x12c/0x2a8
[  328.446427] [<ffffff8008180ad0>] watchdog_unpark_threads+0x0/0x98
[  328.446545] [<ffffff8008138cb0>] __hrtimer_run_queues+0xd8/0x360
[  328.446654] [<ffffff8008139600>] hrtimer_interrupt+0xa8/0x1e0
[  328.446770] [<ffffff8008bca140>] arch_timer_handler_phys+0x38/0x58
[  328.446886] [<ffffff80081261b0>] handle_percpu_devid_irq+0x90/0x2b0
[  328.447238] [<ffffff8008120694>] generic_handle_irq+0x34/0x50
[  328.447673] [<ffffff8008120d80>] __handle_domain_irq+0x68/0xc0
[  328.448118] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  328.451684] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  328.456836] [<ffffff80080ba0b0>] irq_exit+0xd0/0x118
[  328.461383] [<ffffff8008120d84>] __handle_domain_irq+0x6c/0xc0
[  328.467505] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  328.472580] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  328.477654] [<ffffff8008b70058>] cpuidle_enter_state+0xb8/0x380
[  328.483688] [<ffffff8008b70394>] cpuidle_enter+0x34/0x48
[  328.489202] [<ffffff80081111a4>] call_cpuidle+0x44/0x70
[  328.494194] [<ffffff8008111520>] cpu_startup_entry+0x1b0/0x200
[  328.500402] [<ffffff8008f310f4>] rest_init+0x84/0x90
[  328.505136] [<ffffff80095f0b68>] start_kernel+0x374/0x38c
[  328.510901] [<ffffff80095f0204>] __primary_switched+0x80/0x94
[  328.516253] SMP: stopping secondary CPUs
[  328.520357] Kernel Offset: disabled
[  328.523676] Memory Limit: none
[  328.526827] trusty-log panic notifier - trusty version Built: 08:57:16 Feb 19 2022 [  328.547192] Rebooting in 5 seconds..
Shutdown state requested 1
Rebooting system ...

How can I further investigate in which area the CPU was stuck?

Thank you for your help!

I think IGB would refer to an Intel ethernet device. The watchdog timer part says an interrupt was issued for the device to service it, but failed to respond (it was a kernel space issue while servicing the driver). It is hard to say anything more useful. You might consider adding a serial console boot log (what happens prior to this matters since it sets up the environment the driver loads in to), along with the following:

  • Can you verify this is a dev kit, versus a module plus third party carrier board?
  • Which JetPack or L4T release is this?
  • If this is an SD card model, which SD card image is used, and was the Jetson itself flashed with that release (there is QSPI memory used in boot which would affect the Intel ethernet)?
  • If you have ever changed the device tree or kernel, then the nature of the change would be good to know (and if stock, then that too would be good to know).

Thanks for your reply!
We are using a carried board named “Boson for FRAMOS Carrier Board” with stock kernel and device tree from Jetpack 4.6.1.
It indeed contains a secondary i210 pcie ethernet port which was plugged in and in use before the crash occured.

I will try to upgrade to 4.6.2 and install their BSP in case they made any modifications to the kernel regarding this issue and report back with more information if the issue still occurs.

Thanks for your help!

If the third party carrier board has the same exact lane routing, then you won’t need a new device tree. However, if anything is different, then you will need a new device tree (which can affect the BMP). If the secondary i210 is related to a PCIe lane setup which is not an exact duplicate of the dev kit, then this too would cause a need for a device tree edit.

Hello,
So I flashed the latest stock jetpack 4.6.2 kernel, and the second ethernet interface was automatically detected.
I saw in dmesg the following error after a while, still related to the igb driver:

[  271.181751] NETDEV WATCHDOG: eth1 (igb): transmit queue 1 timed out
[  271.181842] ------------[ cut here ]------------
[  271.181887] WARNING: CPU: 5 PID: 16919 at /home/donecle/nvidia/nvidia_sdk/JetPack_4.6.2_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/sched/sch_generic.c:316 dev_watchdog+0x2c8/0x2d0
[  271.181916] Modules linked in: zram ipt_MASQUERADE nf_nat_masquerade_ipv4 bnep iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ath10k_pci(O) ath10k_core(O) ath(O) mac80211(O) btusb btrtl btbcm btintel userspace_alert cfg80211(O) vfat fat compat(O) nvgpu binfmt_misc ip_tables x_tables

[  271.182034] CPU: 5 PID: 16919 Comm: mavros_node Tainted: G           O    4.9.253-tegra #2
[  271.182039] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[  271.182044] task: ffffffc1e9ceaa00 task.stack: ffffffc1eb3bc000
[  271.182051] PC is at dev_watchdog+0x2c8/0x2d0
[  271.182058] LR is at dev_watchdog+0x2c8/0x2d0
[  271.182064] pc : [<ffffff8008d9c4d0>] lr : [<ffffff8008d9c4d0>] pstate: 00400045
[  271.182068] sp : ffffffc1ffda0db0
[  271.182073] x29: ffffffc1ffda0db0 x28: 0000000000000002
[  271.182083] x27: ffffff8009e550c8 x26: 00000000ffffffff
[  271.182094] x25: 0000000000000005 x24: 0000000000000140
[  271.182105] x23: ffffff8009e56000 x22: ffffffc1f0d84460
[  271.182115] x21: 0000000000000001 x20: ffffffc1f0d84000
[  271.182125] x19: ffffffc1f0ca0140 x18: 0000000000000000
[  271.182136] x17: 0000007f7862b428 x16: 0000007f77ebb250
[  271.182146] x15: ffffffffffffffff x14: ffffff800a13cdd8
[  271.182156] x13: ffffff800a13ca27 x12: 0000000000000000
[  271.182166] x11: 0000000000000006 x10: 00000000000003d6
[  271.182178] x9 : 0000000000000000 x8 : ffffffc1ffcfa17b
[  271.182188] x7 : 0000000000000000 x6 : ffffffc1ffda1bf0
[  271.182199] x5 : ffffffc1ffda1bf0 x4 : 0000000000000000
[  271.182209] x3 : ffffffc1ffda77f8 x2 : ffffffc1ffda1bf0
[  271.182219] x1 : ffffffc1e9ceaa00 x0 : 0000000000000037

[  271.182233] ---[ end trace ef3f6df19fbbcc72 ]---
[  271.182249] Call trace:
[  271.182257] [<ffffff8008d9c4d0>] dev_watchdog+0x2c8/0x2d0
[  271.182266] [<ffffff8008136970>] call_timer_fn+0x38/0x1e0
[  271.182272] [<ffffff8008136c8c>] expire_timers+0x144/0x188
[  271.182278] [<ffffff8008136d8c>] run_timer_softirq+0xbc/0x178
[  271.182285] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[  271.182293] [<ffffff80080ba0b0>] irq_exit+0xd0/0x118
[  271.182300] [<ffffff8008120d84>] __handle_domain_irq+0x6c/0xc0
[  271.182306] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  271.182313] [<ffffff8008083634>] el0_irq_naked+0x54/0x60
[  271.184615] igb 0004:04:00.0 eth1: Reset adapter
[  272.286201] bpmp: mrq 27 took 1328000 us
[  273.134373] igb 0004:04:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

Is there an igb backport that I could try?

Thanks for your help!

I am curious, if you run “ifconfig”, then at the bottom of the particular device’s listing there will be listed an “interrupt”. Which interrupt is it? I’ll use “37” for an example since I see this on one NX. Then run the following command and show what it outputs (adjust the 37 to be whatever IRQ it really is):

cat /proc/interrupts | egrep '(CPU0| 37:|^IPI| Err:)'

(there are some spaces in there, so use mouse copy and paste if you can)

The reason I ask this is that the kernel OOPS shown has an ethernet error, but it is a non-critical software IRQ, not a hardware IRQ. A hardware IRQ is triggered by actual wires to/from the ethernet card. A certain amount of code runs, and optionally, a driver can then segregate out driver code which does not directly talk to the ethernet card (triggering the software IRQ). This is good practice to divide only the minimal hardware IRQ from other tasks in the kernel which can be performed after releasing the hardware interrupt from the physical device. This is especially true on a Jetson since most hardware can only talk to CPU0 (there is a wire required to talk to the device…for all CPU cores to be accessible one would need either a wire or a programmable device to distribute to all of the other cores). I am curious if the hardware IRQ itself shows anything unusual. I would expect the device tree to be related to the hardware IRQ drivers, but not to the software IRQ drivers. Not sure if this matters, but it is easy to look at.

Note that the watchdog says the driver just did not respond. This could be from a fault in the software driver. However, it could still be a problem with the hardware driver, e.g., suppose the software driver stopped while waiting for data? If that were the case, then the software driver might be ok and simply starved due to a hardware driver issue.

Hello, thanks for your reply!
I don’t see any interrupt when I type ifconfig eth1 (which is the interface linked to the igb driver), only for eth0. here is the output:

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.3.1  netmask 255.255.255.0  broadcast 10.0.3.255
        inet6 fe80::20c:8bff:feb5:1569  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:8b:b5:15:69  txqueuelen 1000  (Ethernet)
        RX packets 445146  bytes 574172814 (574.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 772  bytes 73555 (73.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device memory 0x1740200000-174027ffff

Thanks for your help!

Is this an add-on USB NIC?

It is an intel i210 pcie ethernet adapter!

This is definitely the Intel ethernet driver failing (though it could be due to underlying hardware which the driver talks to) and timing out. I’ve heard of other Intel IGB driver issues, though I can’t recall what issues people had run into.

In the original kernel panic it is servicing the interrupt for that device, and the device does not respond. In the second case the OOPS fails with the “mavros_node” in user space instead of at the driver, but you can see that directly following the non-fatal OOPS that the IGB Intel NIC driver again is part of this, but it gets away with resetting the NIC, so I think it is again the NIC and/or NIC driver at issue.

The ifconfig output suggests that the NIC/driver combination is capable of working and shows many bytes both sent and received without any kind of error, and so it is hard to say what is going on. Does this always happen, or is there some condition which seems to trigger this? I’m wondering if for example this occurs under heavy load (which might indicate trouble delivering power to the NIC).

Hello, so I reproduced the issue again with jetpack 4.6.2 on a different carrier board hardware (same model), this time triggering a reboot:

[  244.303478] NETDEV WATCHDOG: eth1 (igb): transmit queue 1 timed out
[  244.303546] ------------[ cut here ]------------
[  244.303592] WARNING: CPU: 2 PID: 17412 at /home/donecle/nvidia/nvidia_sdk/JetPack_4.6.2_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/sched/sch_generic.c:316 dev_watchdog+0x2c8/0x2d0
[  244.303622] Modules linked in: zram ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat bnep nf_conntrack ath10k_pci(O) btusb ath10k_core(O) btrtl btbcm btintel ath(O) mac80211(O) userspace_alert cfg80211(O) vfat fat compat(O) nvgpu binfmt_misc ip_tables x_tables

[  244.303747] CPU: 2 PID: 17412 Comm: mavros_node Tainted: G           O    4.9.253-tegra #2
[  244.303751] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[  244.303756] task: ffffffc1f5fab800 task.stack: ffffffc1edfe0000
[  244.303763] PC is at dev_watchdog+0x2c8/0x2d0

An ethernet cable is plugged from eth1 to a sensor which acquires ~3MB/s of UDP data per second. There is little transmisison to the sensor (some bytes per second).

lshw prints the following:

*-network
       description: Ethernet interface
       product: I210 Gigabit Network Connection
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0004:04:00.0
       logical name: eth1
       version: 03
       serial: 00:0c:8b:b5:15:6d
       size: 1Gbit/s
       capacity: 1Gbit/s
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.4.0-k duplex=full firmware=3.25, 0x800005d0 ip=10.0.3.1 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
       resources: irq:33 memory:1740200000-174027ffff ioport:1000(size=32) memory:1740280000-1740283fff

Is it a good idea to try to compile a more recent igb driver?
Thanks for your help!

So I compiled the latest igb driver out of tree (I had 5.4.0, I have now 5.11.4, it loads correctly and eth1 still work), I will see what happens now!

*-network
       description: Ethernet interface
       product: I210 Gigabit Network Connection
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0004:04:00.0
       logical name: eth1
       version: 03
       serial: 00:0c:8b:b5:15:6d
       size: 1Gbit/s
       capacity: 1Gbit/s
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi msix pciexpress bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=igb driverversion=5.11.4 duplex=full firmware=3.25, 0x800005d0 ip=10.0.3.1 latency=0 link=yes multicast=yes port=twisted pair speed=1Gbit/s
       resources: irq:33 memory:1740200000-174027ffff ioport:1000(size=32) memory:1740280000-1740283fff

Let us know if it works. If not, then I suggest flashing L4T R32.7.2 (which is the most recent; whatever IGB driver is there is likely matching the rest of the kernel and more debugged).

Unfortunately updating the driver does not solve the problem, had another stack trace this morning:

[ 1661.264385] NETDEV WATCHDOG: eth1 (igb): transmit queue 0 timed out
[ 1661.264455] ------------[ cut here ]------------
[ 1661.264496] WARNING: CPU: 1 PID: 839 at /home/donecle/nvidia/nvidia_sdk/JetPack_4.6.2_Linux_JETSON_XAVIER_NX_TARGETS/Linux_for_Tegra/sources/kernel/kernel-4.9/net/sched/sch_generic.c:316 dev_watchdog+0x2c8/0x2d0
[ 1661.264525] Modules linked in: zram ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat bnep nf_conntrack ath10k_pci(O) ath10k_core(O) ath(O) vfat mac80211(O) fat btusb btrtl btbcm btintel cfg80211(O) userspace_alert igb(O) compat(O) nvgpu binfmt_misc ip_tables x_tables

[ 1661.264638] CPU: 1 PID: 839 Comm: mavros_node Tainted: G           O    4.9.253-tegra #1
[ 1661.264641] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[ 1661.264648] task: ffffffc1f4e39c00 task.stack: ffffffc1c9bec000
[ 1661.264655] PC is at dev_watchdog+0x2c8/0x2d0
[ 1661.264660] LR is at dev_watchdog+0x2c8/0x2d0
[ 1661.264666] pc : [<ffffff8008d53550>] lr : [<ffffff8008d53550>] pstate: 00400045
[ 1661.264670] sp : ffffffc1ffd40db0
[ 1661.264674] x29: ffffffc1ffd40db0 x28: 0000000000000002
[ 1661.264685] x27: ffffff8009de50c8 x26: 00000000ffffffff
[ 1661.264695] x25: 0000000000000001 x24: 0000000000000140
[ 1661.264704] x23: ffffff8009de6000 x22: ffffffc1eb540460
[ 1661.264714] x21: 0000000000000000 x20: ffffffc1eb540000
[ 1661.264724] x19: ffffffc1e644d000 x18: 0000000000000000
[ 1661.264734] x17: 0000007f9998f428 x16: 0000007f9921f250
[ 1661.264744] x15: ffffffffffffffff x14: ffffff800a0c2dd8
[ 1661.264753] x13: ffffff800a0c2a27 x12: 0000000000000000
[ 1661.264763] x11: 0000000000000006 x10: 0000000000000461
[ 1661.264774] x9 : 0000000000000000 x8 : ffffffc1ffcfe78f
[ 1661.264784] x7 : 0000000000000000 x6 : ffffffc1ffd41bf0
[ 1661.264794] x5 : ffffffc1ffd41bf0 x4 : 0000000000000000
[ 1661.264803] x3 : ffffffc1ffd477f8 x2 : ffffffc1ffd41bf0
[ 1661.264813] x1 : ffffffc1f4e39c00 x0 : 0000000000000037

[ 1661.264827] ---[ end trace d6df15a35afe92c3 ]---
[ 1661.264843] Call trace:
[ 1661.264851] [<ffffff8008d53550>] dev_watchdog+0x2c8/0x2d0
[ 1661.264858] [<ffffff8008136970>] call_timer_fn+0x38/0x1e0
[ 1661.264864] [<ffffff8008136c8c>] expire_timers+0x144/0x188
[ 1661.264870] [<ffffff8008136d8c>] run_timer_softirq+0xbc/0x178
[ 1661.264876] [<ffffff8008081054>] __do_softirq+0x13c/0x3b0
[ 1661.264884] [<ffffff80080ba0b0>] irq_exit+0xd0/0x118
[ 1661.264891] [<ffffff8008120d84>] __handle_domain_irq+0x6c/0xc0
[ 1661.264896] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[ 1661.264902] [<ffffff8008083634>] el0_irq_naked+0x54/0x60
[ 1662.231902] igb 0004:04:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 1664.532083] igb 0004:04:00.0 eth1: igb: eth1 NIC Link is Down
[ 1668.477648] igb 0004:04:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

If I understand well CPU: 1 PID 839 etc simply tells what the cpu was doing at the time of the panic, I wonder if the process name (which is always mavros_node) can be related to the panic?
This is with the latest L4T R32.7.2 with latest intel igb drivers.

I will try to continue the tests without this ‘mavros_node’ to see if we still have a backtrace. Thanks for your help!

We basically know that the CPU stopped responding and the watchdog timer triggered. We also know that the IGB driver is related. Quite possibly the IGB issue is from something the mavros_node did, but we don’t really have any details. For example, perhaps IGB has a bug under certain data conditions, and mavros_node sends that data; but perhaps IGB is perfectly fine, but somewhere mavros_node itself fails while sending to IGB, and indirectly times out IGB. We don’t know.

Someone from NVIDIA may be able to say more if they can reproduce this, but you’ll need to give information for reproducing. Give the exact make/model of Jetson, what release is flashed to it, what software has been added (e.g., for Mavros), and what hardware is added (e.g., network cards, even keyboard/mouse). Basically anything which might allow NVIDIA to cause this to occur for them while watching.

Thanks for the explanation!
We have another device on the PCIe port, a wireless M2 card. Could they draw too much power on the pcie port, causing timeouts? Is there a way to check the current pcie power drawn?

Thanks!

Other devices, including PCIe, can cause timeouts and other failures in other hardware and/or drivers. I don’t see any indication of a power failure issue though. Certainly there are times when power draw does cause instability, but I don’t think this would normally cause this kind of timeout without some other logged error (it could, I just don’t think it is probable). The driver to the m.2 card is more likely to be an issue than is the power draw, but we also cannot confirm that.

The reality is that it would be best if NVIDIA had that exact m.2 PCIe card to install and test with. Perhaps they have one available, but you’d need to provide the exact model, and if you installed a driver, then give details of exactly where the driver is from (e.g., a driver downloaded separately and compiled and added as a module, versus a driver compiled and installed from the NVIDIA-provided kernel source, so on). Even if it is just an exact model with a known driver which is not available for others to debug it is possible to find reports related to that driver on the Internet, which would be a good clue. Details for recreation are important.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.