Unscheduled reboots of Jetson Xavier AGX DevKit

Hi,

I am running Jetpack 32.4.3 on a Xavier DevKit which keeps rebooting randomly.

Attached the dmesg output and the syslog which show a CPU stall. I am using the D3 Engineering camera board (approved NVIDIA Jetson Camera Partner) and they have indicated its an issue related to NVIDIA HW.

Is this a known issue? Is there a patch for this L4T release?

Output of cat /etc/nv_tegra_release :

R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

Thanks,
Sandip

dmesg_13Oct.txt (85.9 KB)
syslog_oct13 (2.2 MB)

hello sandiprmlc0,

could you please try remove the camera board to reproduce the issue,
besides, could you please also confirm you’re providing correct power-supply to the Xavier platform.

Hi,

could you please try remove the camera board to reproduce the issue,

Yes, but before I do that, do you need more logs, core dumps etc. from the system? Is there a livepatch that I can simply try with the current setup?

I am supplying power using the approved LITEON 19V adapter to the AGX.

Best,
Sandip

We have also been experiencing such random reboots using an existing 32.3.1 image we were flashing on new boards. This image worked well in the past but then started being unstable on new systems we were installing it on.

I ported our setup to 32.6.1 and have not seen any problems so far - keeping an eye on this.

Could it be that newer revisions of boards are not supported by older Jetpack versions?

1 Like

hello sandiprmlc0,

yes, please setup serial console to gather uart logs, or you may put terminal there to keep gathering kernel logs with $ dmesg --follow

1 Like

Random reboots seem to be quite the standard on the AGX.
I have noticed that having a ping to the gateway 192.168.0.1 (or similar) running in the background can keep the AGX running with up times of several months. As far as I can see there is some bug in either the network drivers or the networking hardware. Perhaps even power management related. I’ve tried to find the cause but only found this work-around. My suspicion for the network was triggered by intermittent network pauses. In all after network issues a reboot would be imminent. It’s like some interrupts don’t get noticed but the ping will cause the network stack to be nudged to keep processing pending packets.
There have been a multitude of reports on reboots and i have read many but there seems to be no knowledge of what causes it nor how to solve it. I guess I have learned to live with it. (Won’t be using any AGX in production anywhere here due to this issue.) For desktop use and development use it’s fine. Reboot is fast. Perhaps a future iteration with a newer processor will be rock stable, who knows. I have 2 jetson AGX systems for a full year now in testing and both have the same issue. One runs stock Ubuntu 18.04.5 LTS the other runs Debian Sid.

F.w.i.w.
[ 95.531425] [] el1_irq+0xe8/0x194
[ 95.531467] [] nf_conntrack_in+0x100/0x940 [nf_conntrack]
[ 95.531480] [] ipv4_conntrack_in+0x30/0x40 [nf_conntrack_ipv4]

Shows clearly that this is networking IRQ related and I have not read any sensible explanation anywhere in these forums nor fixes. I checked the changelogs for the NVidia kernels but can’t find anything that appears to be addressing it.

The only thing i recently saw was that they disable reboots on hung tasks in newer kernels… sounds more like a quick way to reduce reboots than a fix for the cause. (forgive my cynicism)
I hope the upcoming newer kernel will have a fix. The 5.x kernels are not yet available for testing but i hope to see them soon.

1 Like

Please check the /proc/interrupts and see if there is abnormal interrupt coming from bluedroid_pm.

It is a known issue on AGX Xavier. We have a patch on this forum too. Unfortunately it didn’t catch the last release.

The patch WayneWWW is refering to is this one: AGX Xavier freeze in MAXN mode - #22 by WayneWWW

1 Like

Thanks dkreutz for the link to the patch. Could I apply this to the JP 32.4.3? On your link, the patch seems to be for JP 4.4.1/4.5. Can it be applied to 4.3 as well? WayneWWW?

Thanks janrinze for your detailed reply! Yes, I have been seeing these intermittent network pauses as well! Ok, I will set up a background process to ping to the local gateway. Btw I am using VPN tunnel interfaces on the Jetson as well, should I instead ping on this tunnel interface?

This is an interesting question. WayneWWW any feedback?

I have successfully applied the patch to JP 4.4 and 4.5 and recently to JP4.6.
Don’t know if it works for 4.3. Maybe you simply look into the sources if there is the same/similar code as well.

1 Like

A quick way to validate whether this patch could resolve the problem is directly remove bluedroid_pm driver from the lsmod list.

If this issue is gone, after removing this driver, then that patch can work.

[quote=“BareMetalCoder, post:5, topic:191975”]
Could it be that newer revisions of boards are not supported by older Jetpack versions?
[/quote]

You can check the PCN list here and see if any of your Xavier gets affected by those.

Thank you. It’s good to discover the existence of these! That being said, it does not appear to be the case. The only significant one I see for AGX is the memory size change, but this required r32.1 which is older than any L4T we have used.

Hi @WayneWWW,
Disabling the bluedroid driver has not resolved the issue. Attached the syslog from the last reboot. Does it mean that the patch will not work in this case?

@janrinze,
I don’t think its an issue with the network stack not being active as the crash is observed when I send (and receive) a lot of traffic on the network interface.

I think the problem has multiple causes and I’m trying to narrow down the source but it seems there is different/inconsistent information in the syslogs/dmesg output everytime the system reboots…
syslog.1 (1.4 MB)

  1. Yes, if this issue is not related to bluedroid_pm, then that patch may not work. But please make sure you really disable it by checking “lsmod” command.

  2. I think the logs you shared so far do not help and they didn’t indicate the real cause of your reboot.
    For example, your dmesg in the first comment indeed has a kernel panic. But it still prints later usb log. It means the board didn’t reboot due to that panic.
    Also, in most of time, sudden reboot will not be recorded in syslog.

Thus, what you should do is remove “quiet” inside /boot/extlinux/extlinux.conf, and dump your log from the serial console instead of dmesg command and syslog. Use this console to monitor the error until it reboots.