Jetson AGX Xavier self rebooting

This would make sense, but the original issue does appear to be related to network. However, if an external source heated up networking chipsets, then perhaps it is related (that’s a very big leap in imagination, so I still suspect the network issue and heat are separate, but the two could occur at the same time and make figuring it out quite difficult).

My network is pretty limited with the setup is the following:
2 computers

  • Jetson AGX DevKit → wired
  • Workstation → wired/wifi

are connected to:

Netgear_AC1200_R6120 Wifi router → Router’s “Internet” connector → Ununtu 18.04 LTS acting as a Wifi bridge → Wifi → Internet Fiber connected Wifi router.

I doubt anyone is messing with the Jetson.

Thanks
Simon

Assuming the router is not in “bridging” mode, then I would agree. You used the “bridge” term, and I think this is likely just language semantics, but can you verify that the world outside of the router cannot initial connections directly to the Jetson? Basically, if the Jetson must originate connections, then it is not in bridging mode.

Also, is there much network traffic at the time of reboot?

I was using the Ubuntu VNC server for UI remote access. Since then I have Unminized the DevKit and only access the unit with ssh.

Thanks
Simon

That is a good point. I will find a utility that can log that info and see what I can find.

Thanks
Simon

For reference, the other thread which is technically a separate thread, but perhaps related to this, is:
https://forums.developer.nvidia.com/t/agx-xavier-kept-rebooting-after-crash/153758

(both threads have a similar network error, but not exact errors, and both could have different thermal and power issues)

Thanks @linuxdev for the follow up.

I might be missing something but if it is a network stack issue with a stable Ubuntu version (18.04 LTS), talk about network instability! I happen to have a 10 year laptop/router that bridges to the Internet router, with the exact same Ubuntu version and it has been running non-stop for weeks.

This is not a Raspberry Pi that I can just replace for 100$.

I wish someone from Nvidia could weight in here because I am starting to feel I have a faulty unit with little recourse to resolve the issue.

Thanks
Simon

@linuxdev: I am monitoring the network with nload.

Will keep you posted.

Thanks
Simon

Hi linuxdev, I think you are on something here. I purposely disable the AGX Xavier network and re-run all the apps that cause GPU > 45C again, and won’t be able to crash the system (or cause the self reboot) anymore…

if so, what’s the fix would be recommended? because in our product we need network to be on for transmitting out the result in real time…

Thanks a lot for your help.

Wow, excellent test !!!

Time to look at a Wifi solution. There is a M2 card slot on the dev kit.

Intro here: Jetson Nano + Intel Wifi and Bluetooth - JetsonHacks.

Hi @ynjiun and @linuxdev

Chatted with support and my only option is to return the unit. Ah well …

Thanks for your time
Simon

just curious, when did you buy your unit?

Hi @ynjiun,

July 27, 2020.

Cheers
Simon

Could you share the JP version you are working on?

You may get with $ head -1 /etc/nv_tegra_release

R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

interesting. I bought it in June but has the same JP version as yours. (No wonder it behaves “similar”). Before you return your unit, if you don’t mind, would you like to do one test to see if the self reboot disappear? The test is: whatever you are running before that cause the system self reboot, run the same test with network disable. (To disable network, just click the top right corner network icon and click “disconnect” below wired connect 1.) This might “prove” it may be network causing all these trouble.

Once you get your new unit, please let me know if this self reboot problem disappear. Thanks.

Sure, no problem. I will run my stuff for a couple of days and get back to you.

This is aarch64/arm64 architecture, so the code actually running is quite different than the PC.

The RPi is a good example though since it is running the same architecture (I am assuming this is the 64-bit version). Does the RPi use the same desktop window manager? Does the RPi have the same NoMachine installed? The questions are designed to find a trigger to the problem.

Finding a way to reproduce the problem for NVIDIA to test implies it important to consider such things as whether or not NoMachine is involved in the failure. If the bug can be reproduced, then a solution is far faster/easier.

@linuxdev, I have no issue with you helping to narrow down the issue @ynjiun and I are experiencing. It’s all good :-)

I am trying to reproduce the network error by running from a Mac on the same network:
1 - ping -f 192.168.5.6 (flood 56 bytes)
2 - ping -c 10 -g 1048 -G 4096 -h 100 192.168.5.6

On the devkit, nload on eth0 currently displays:
Incoming
Curr: 9.71 MBit/s
Avg: 9.16 MBit/s
Min: 824.00 Bit/s
Max: 14.28 MBit/s
Ttl: 2.54 GByte

Outgoing
Curr: 11.36 MBit/s
Avg: 10.75 MBit/s
Min: 6.79 kBit/s
Max: 16.73 MBit/s
Ttl: 2.98 GByte

3 - overloading the CPUs with a python script running on the devkit:
top - 16:14:35 up 4:05, 10 users, load average: 8.19, 8.11, 6.83
Tasks: 340 total, 9 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 88.4 us, 3.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.8 hi, 5.4 si, 0.0 st
KiB Mem : 32692948 total, 31291584 free, 696116 used, 705248 buff/cache
KiB Swap: 16346464 total, 16346464 free, 0 used. 31625048 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8804 simon 20 0 16744 7000 2164 R 99.0 0.0 27:09.19 python
8802 simon 20 0 16744 7000 2164 R 98.7 0.0 26:00.08 python
8801 simon 20 0 16744 7000 2164 R 98.4 0.0 27:12.98 python
8808 simon 20 0 16744 7004 2164 R 98.4 0.0 27:17.55 python
8803 simon 20 0 16744 7000 2164 R 98.1 0.0 23:42.39 python
8807 simon 20 0 16744 7004 2164 R 98.1 0.0 27:14.02 python
8806 simon 20 0 16744 7004 2164 R 96.8 0.0 20:16.34 python
8805 simon 20 0 16744 7004 2164 R 39.3 0.0 26:08.14 python
7928 simon 20 0 9268 3740 2848 R 1.6 0.0 0:43.43 top
3 root 20 0 0 0 0 S 1.3 0.0 0:43.24 ksoftirqd/0

Current tegrastats is:
RAM 745/31927MB (lfb 7628x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190] EMC_FREQ 0% GR3D_FREQ 0% AO@44.5C GPU@46C Tdiode@47.25C PMIC@100C AUX@44.5C CPU@46C thermal@45.6C Tboard@44C GPU 0/0 CPU 3729/3354 SOC 1087/955 CV 155/155 VDDRQ 465/404 SYS5V 1284/1279

The CPU is currently at 46C which seems to contradict @ynjiun results.

I am now looking for a way to load the GPU…

Cheers
Simon

The above says a lot. 100 degrees C is rather high. Boiling hot. This is the Power Management IC. At the same time every single CPU is at 100%…not that this is an error per se, but it implies more power consumption, and thus more stress on the PMIC. I have no idea if the networking hardware is physically near the PMIC or not, but perhaps this too is heating up.

Note that ping uses ICMP protocol, which is quite different than the TCP or UDP of the virtual desktop software. However, the network was showing a stack dump after being declared a “soft” lockup. The implication is that it was not responding, and the assumption is that there was a code error. However, lack of response could also be due to either IRQ starvation or simply having the system under such a huge load that response was truly that slow. If this latter is the case, then the network stack dump is just the first symptom, and not the actual cause.

Anything other than keyboard/mouse/monitor you might have connected, and which it consumes any significant power from the Jetson, can be tested as a contribution to this if you have a means to provide external power to the device instead of powering directly from the Jetson.