AGX Xavier kept rebooting after crash

I bought an AGX Xavier in June. Since then, I experience sporadic self rebooting (to an annoying point), today it happens again and after crash (at around GPU temperature <= 45C) the unit kept rebooting by itself again and again… eventually I need to manually to unplug the power supply to stop it. Attached please find the serial console log when this happen: after_reflash_6_undistort_crash_kept_rebooting.log (121.6 KB)

Could anyone help to share your experience on how to solve this problem? Thanks a lot for your help.

FYI, this is a network issue, as seen in this excerpt:

[  621.135763] tegradc 15200000.nvdisplay: dc_poll_register 0x41: timeout
[  621.135932] tegradc 15200000.nvdisplay: dc timeout waiting for cursor act_req
[  640.547619] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
[  640.548380] Kernel panic - not syncing: softlockup: hung tasks
[  640.548540] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L  4.9.140-tegra #1
[  640.548680] Hardware name: Jetson-AGX (DT)
[  640.548775] Call trace:
[  640.548848] [<ffffff800808bdb8>] dump_backtrace+0x0/0x198
[  640.548959] [<ffffff800808c37c>] show_stack+0x24/0x30
[  640.549068] [<ffffff800845c7a0>] dump_stack+0x98/0xc0
[  640.549179] [<ffffff80081c1438>] panic+0x11c/0x298
[  640.549285] [<ffffff8008181760>] watchdog_unpark_threads+0x0/0x98
[  640.549416] [<ffffff80081399e0>] __hrtimer_run_queues+0xd8/0x360
[  640.549532] [<ffffff800813a330>] hrtimer_interrupt+0xa8/0x1e0
[  640.549653] [<ffffff8008bfea80>] arch_timer_handler_phys+0x38/0x58
[  640.549776] [<ffffff8008126f10>] handle_percpu_devid_irq+0x90/0x2b0
[  640.549897] [<ffffff80081214f4>] generic_handle_irq+0x34/0x50
[  640.550253] [<ffffff8008121bd8>] __handle_domain_irq+0x68/0xc0
[  640.550732] [<ffffff8008080d44>] gic_handle_irq+0x5c/0xb0
[  640.551891] [<ffffff8008082c28>] el1_irq+0xe8/0x194
[  640.556790] [<ffffff8008decfa8>] netlink_broadcast_filtered+0x60/0x440
[  640.563177] [<ffffff8008ded3d8>] netlink_broadcast+0x50/0x68
[  640.569062] [<ffffff8008def778>] nlmsg_notify+0x68/0x120
[  640.574372] [<ffffff8008dbf86c>] rtnl_notify+0x5c/0x70
[  640.579191] [<ffffff8008e8fb38>] ndisc_router_discovery+0x838/0x9d0
[  640.585487] [<ffffff8008e901a4>] ndisc_rcv+0xec/0x668
[  640.590740] [<ffffff8008e989a4>] icmpv6_rcv+0x374/0x568
[  640.596160] [<ffffff8008e75e94>] ip6_input_finish+0xe4/0x4f0
[  640.601677] [<ffffff8008e762d8>] ip6_input+0x38/0xb8
[  640.606922] [<ffffff8008e76930>] ip6_mc_input+0xc8/0xf0
[  640.612171] [<ffffff8008e75d24>] ip6_rcv_finish+0x64/0xf0
[  640.617252] [<ffffff8008e76698>] ipv6_rcv+0x340/0x510
[  640.622676] [<ffffff8008da8810>] __netif_receive_skb_core+0x3b8/0xad8
[  640.629150] [<ffffff8008dabc00>] __netif_receive_skb+0x28/0x78
[  640.635185] [<ffffff8008dabc7c>] netif_receive_skb_internal+0x2c/0xb0
[  640.641225] [<ffffff8008dac8a4>] napi_gro_receive+0x15c/0x188
[  640.647178] [<ffffff800894dd90>] eqos_napi_poll_rx+0x358/0x430
[  640.652950] [<ffffff8008daded4>] net_rx_action+0xf4/0x358

Perhaps it is related to IPv6 (which is not tested nearly as well as IPv4), but I could not tell you how to track the specifics.

For reference, the similar but “not same” thread is:
https://forums.developer.nvidia.com/t/jetson-agx-xavier-self-rebooting/148000/22

In the other thread he is having network related reboots. It is difficult to say what the specific issue is, but perhaps it is partially data driven. Someone else may know how to look closer that network issue, and this in turn would probably lead to more information about either overheating or reboot.

One more question which might be useful: Is your network behind a router, or is the outside world able to directly reach the Jetson without the Jetson having initiated the connection?

It’s LAN under DHCP. I don’t think it’s visible outside. Typically I use my PC (ubuntu 18.04) connect to my AGX thru ssh (in the past I use Nomachine, and now no more) Do you have any recommendation on how to setup the network to minimize such problem? (or what exactly this network problem is? network driver issue?) do you think disable ip6 would help?

One other user has had a similar issue, but he is using IPv4 (I’m not positive, but about 90% certain it was IPv4). Regardless of IPv4 versus IPv6, there should not be a kernel dump. His case also shows the same network functions as soft lockup. Actually disabling IPv6 is unlikely to help. I’m hoping this does not get split into too many threads since it makes tracking it more difficult.

It is possible that applications using data in a particular way (such as NoMachine) does trigger the issue, so it is important to each thread to state what kind of “non-standard” network use might be going on at the time of the failure. For example, if a history of installing NoMachineis in common, then it might be related to an installed or upgraded file, or a configuration related to the application making changes to networking.

In the past wired networking has been reliable, and even with issues common to WiFi, this tended to be a configuration issue which did not result in a kernel error log. Virtual desktop software is kind of a special case in network use, and thus the combination of multiple users and similar network issues under NoMachine implies an even more interesting detail.

see this https://forums.developer.nvidia.com/t/agx-xavier-easy-to-crash-when-ethernet-network-connected/153777/5?u=ynjiun and with network connected, it may self reboot without running anything…

Which JetPack version is this?

Regards,
Greg

JetPack 4.4 running on AGX Xavier
JP version:
# R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

The other thread mentions this patch, and this would be the next step:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8
…it looks like the network side is already a known issue with a fix, and that any GPU error is a side effect of the network side not responding.

Thanks linuxdev, do you think this network issue can manifest to so many facets as I just posted?

could you show me the steps on how to apply this patch? (if the above post still make sense to you)

Yes. A soft lockup will often be found when it interferes with another function.

Despite how complicated the following is going to sound it is much easier in practice to use a patch. I’m just trying to give some information on how patch works, and how you could manually edit to produce the same thing. Patching is where I am starting because I don’t know if you already have kernel source and have any experience building kernels and/or modules. Kernel build has a lot of options in how it can be done, and official docs do show how to build from cross compile on an Ubuntu 18.04 host PC. You can ask more if you need, but it is useful to have some reference available on the forum for other people looking at this.


The provided patch looks like a copy of an email with the patch itself appended. The actual patch mentions the involved files on lines 104 and 105 (you could delete the email content in lines 1 through 102 and you’d have the actual patch which the “patch” command can work with). Those first mentions of files to be patched are:

drivers/net/ethernet/nvidia/eqos/drv.c
drivers/net/ethernet/nvidia/eqos/mdio.c

(which are in the kernel source…you can check official documents on kernel build, but the gist is that you would compile with a match to your current system’s configuration, and then build…perhaps just modules, which is simplest, but perhaps an entire kernel Image plus modules)

You could remove the email part of the patch and use the patch program to apply, but the gist is that where you see something like this it means the change is being applied to remove code from that file, and then to add code to the file:

--- a/drivers/net/ethernet/nvidia/eqos/drv.c
+++ b/drivers/net/ethernet/nvidia/eqos/drv.c

Lines like this imply removing a line, and then adding it back in with edits:

-	spin_lock(&pdata->lock);
+	spin_lock_bh(&pdata->lock);

(in this case to file drivers/net/ethernet/nvidia/eqos/drv.c)

You could manually examine the lines listed in the patch and look and you would see the line as it currently exists (presumably an exact match to the “-” line). Then imagine using your editor to change that line to the “+” line.

The patch program (see “man patch”, and especially in the man page find the “-pnum” description) simply automates those edits. It finds the mentioned files, finds the lines, checks that the correct content is there, removes that content, and adds back in the new content.

The “-p<number>” option is just how much of the parent directory to strip away. Notice that the top level listing of the file to cut from and paste to has an abstract “a/” or “b/” prefix:

--- a/drivers/net/ethernet/nvidia/eqos/drv.c
+++ b/drivers/net/ethernet/nvidia/eqos/drv.c

(note that “a/” and “b/” are just abstract symbols for “where you are currently at”…the top level of kernel source…where “a” and “b” mean “before” and “after”)

If you have actually changed to some directory which has subdirectory drivers/net/ethernet/nvidia/eqos/, then patch would need to ignore the “a/” and “b/”, and work on the “drivers/” subdirectory content; if you had used cd to change directly to the “drivers/net/ethernet/nvidia/equos/” subdirectory, then you would have to trim all of that prefix path since you are already in that directory. Each level of the “-p” strips a leading part of the path to the file. A “-p6” strips 6 leading path subdirectory names (and this means throwing away “a/drivers/net/ethernet/nvidia/eqos/”, with only “drv.c” remaining).

If you run patch and it fails from not finding files, then probably the “-p#” was wrong. No harm done. If you run patch and it fails because it does not find the original code, then perhaps the patch is already complete. No harm done. Or perhaps your source code is from a kernel release too far different, and patch is unable to figure out what is going on…in which case the patch probably does not apply to that kernel release.

FYI, the “diff” tool compares two files (e.g., edited and original file versions), and outputs the content which creates a patch. The “patch” tool is the reverse of a “diff” tool. Someone ran diff, and the result can be used to automate the same edits through patch.

If you are at the top of the kernel source, then you will be able to see this file exists:
ls drivers/net/ethernet/nvidia/eqos/drv.c

This is an example patch command, and would result in either saying “hunks” succeeded, or failed. The “hunks” are just blocks of code which are removed and then reinserted with edits. Example:

patch -p0 < /where/ever/that/patch/is/0001-ethernet-eqos-fix-lockup-due-to-SOFTIRQ-unsafe.patch

You could post the result of this command if it looks like some “hunk” did not succeed. Perhaps that part of the patch is already installed (I have not personally installed that patch).

After that it is a case of kernel and/or module compile.


Topics on actual kernel build can get quite long. I don’t consider it difficult, but people tend to fear the number of steps. The gist is that if you save a copy of your running system’s current “/proc/config.gz”, then other than the option known as “CONFIG_LOCALVERSION”, this is an exact match to your running system. If you can build an exact match, then it is much easier to introduce small changes. Two kernels of the same release with different configurations are incompatible and not the same thing until configuration matches.

One can build a whole kernel, or one can build a module. Not everything can be built as a module, and not everything can be built as integrated. If you patch something which was built as a module, then your life is very simple. If you patch something which requires editing a feature which was integrated directly in to your kernel, then it gets more complicated. Save a safe copy of your “/proc/config.gz” before starting, as this will answer the question.

Also, before starting, along with the config.gz, write down the result of the command “uname -r”.


For the URLs below you may need to go there, log in, and then go there again, but kernel source is available for your particular L4T release (see either “head -n 1 /etc/nv_tegra_release” or "dpkg -l | grep 'nvidia-l4t-core' for finding L4T release version):
https://developer.nvidia.com/linux-tegra
…and general documents:
https://developer.nvidia.com/embedded/downloads

The documents above show how to cross compile and install from a host PC, but feel free to ask more if you want to build directly on the Jetson (native compile) or ask questions before installing. Do be sure to save a copy of your “/proc/config.gz” to post here with your questions. You probably would want to “gunzip config.gz” (from a copy somewhere else, e.g., after a copy to your host PC), and rename it with a “.txt” filename suffix…the file is human readable). Also comment on the output of “uname -r” with any post of the config file.

Mostly you will be free to experiment while patching and building, but do ask questions on adding content to the Jetson before actually adding it. Perhaps you don’t need to install the full Image, and a simple module copy might do the job.