Jetson AGX Xavier self rebooting

one more vote on network stability is questionable.

Just found this network kernel patch

I am debating whether to install or not to see if this fix “network stability” issue or not…

@ynjiun: if you tell me how, I will install the patch.

just curious what power mode you are using while experiencing all these self reboot: MAXN, 30W ALL, etc.? or you did run “sudo jetsen_clocks” every time when boot up?

I have tried all modes and it doesn’t seem to matter.

How about the patch? I checked out the file but it’s more or less an email with a code diff. I am not sure how to apply it.

In any case, the reboot seems to be happening when the GPU is involved (video display, DL).

Why that? Because the 8 cores of the CPU were at 100% for 2 hours, reaching 47C with no issue. But stopping the CPU load and displaying a youtube video, in full screen HD, triggered the reboot in the next hour. I working on something else so I am not sure how long it took to crash.

From the patch, it could be a sync issue…

Hi @linuxdev

I think I found the issue: GPU overheat.

The default fan setting is quiet which has a trip temp of 46C. I changed the setting to cool which has a trip temperature of 35C with:
sudo nvpmodel -d cool

Since then, the devkit has been playing youtube HD full screen videos non-stop with no issue.

Here is the latest tegrastats:
RAM 2440/31925MB (lfb 6939x4MB) SWAP 0/15963MB (cached 0MB) CPU [31%@2265,27%@2265,22%@2265,24%@2265,31%@2265,38%@2265,36%@2265,43%@2265] EMC_FREQ 0% GR3D_FREQ 28% AO@34C GPU@34.5C Tdiode@36.5C PMIC@100C AUX@34C CPU@36C thermal@34.95C Tboard@34C GPU 619/670 CPU 4183/3586 SOC 2788/2544 CV 154/154 VDDRQ 929/897 SYS5V 2564/2474

Thanks
Simon

@ynjiun : That would tend to imply 100C is a trip temperature, rather than actual temperature, so you are correct about that. I went and examined a couple of Jetsons and they all had that behavior. Some of the temperature monitoring only tells you about trip point, rather than being an actual measurement, and this is apparently one of them.

@simon.glet , that appears to be the same error. What kind of hardware was the video from? I could see the possibility of any virtual desktop making custom adjustments to networking and triggering something which is not commonly occurring in most situations (a corner case). This particular case also shows (as you mentioned) some GPU involvement higher up in the stack frame, and then below this in the stack frame are the same network problems. If you have a URL to the video or more information it would help.

What makes this more recent stack frame interesting is that GPU calls were made after network calls, which would make sense if network data is driving GPU activity. In the previous cases which were posted the GPU activity was not necessarily present in the stack frame. There is a strong chance that the GPU is just another way the bug shows up, and is not necessarily the original cause. A network error should be correctable, but seems to cause rebooting; however, perhaps the GPU driver also is not handling the error condition which has been passed to it.

The first function call which starts something “specific” in the failure is this:

Sep 7 17:11:16 simon-desktop kernel: [18133.445329] [] net_rx_action+0xf4/0x358

…the GPU has not even been involved yet at that point in the stack frame. After some network activity there is another IRQ, and timers start failing. The GPU errors are part of normal logging, and not part of the stack frame, but the GPU error apparently is going on while the stack frame is being dumped:

Sep 7 17:10:02 simon-desktop kernel: [18059.497433] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:10:02 simon-desktop kernel: [18059.497640] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505

I am inclined to believe that the GPU error message is just a side effect of network code gone wrong. The error is that the GPU is in need of acquiring a semaphore, but cannot. This is out of the control of the GPU, and is the result of something else blocking this. It is a bit like driving up to a gas station to refill the car, but there is a line of hundreds of people in front, and one of them has a dead engine…nobody behind that car could access the gas even if some is available.

If you can provide a way to replicate this, then someone from NVIDIA could probably go straight into the stack frame and find the specific network condition which is stalling out. This issue is part of networking, but it is interfering with the GPU when those virtual desktops are involved.

@ynjiun and @simon.glet: This is a good idea (perhaps both of the people with issues could apply this patch and try again):

…I think you’ve just found one of the triggers to the same network issue, and if that patch worked for the other soft lockup, then it will very likely work with virtual desktop network issues as well.

FYI, in theory, if the soft lock is something which is just too high of a load, then running max performance could help, but only to an extent. If there is a software bug causing the soft lockup, then there is no possibility of performance modes helping. Either way the real solution is to stop the soft lockup (and it looks like the patch above is most likely the fix).

I do not think that GPU temperature is the cause. Keep in mind that if the system is running in a lower performance mode that the timers which deal with whether or not there is a soft lockup can also begin later…if there is some sort of data required to send to the GPU, then it is already running prior to the GPU ever trying to use that data. Running in lower performance mode could actually give the data more time to go through the system prior to the soft lockup timer being started. I think the earlier mentioned patch is on target:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8

@linuxdev

Being as it may, the DevKit has not rebooted since I started to the test 3.5 hours ago, running at MaxN. It is a stability record! :-)

The unit might not have to be returned which is great news.

I opened a bug ticket (3119509) about the fan setting change and hope for the best.

Thank you all for your help.
Cheers
Simon

Hi simon, did you apply the above mentioned patch?

Hi @ynjiun

No, I did not.

hmmm… that means so far the only thing you did is:

sudo nvpmodel -d cool

interesting. I did that, but still self reboot.

By the way, how did you contact the support? Do you know which phone number to call? thanks for sharing.

@ynjiun,

I am sorry to hear that your unit is still rebooting … Be aware that by rebooting the fan cooling goes back to its default (quiet). For more info. please checkout: Welcome — Jetson Linux<br/>Developer Guide 34.1 documentation

I read this thread of someone that has rebooting issues: We need the Industry Grade (-40 ~ + 85°C) AGX Xavier module.

Based on that thread and you mentioning your current working temperature was around 28C, your GPU might still be overheating.

Did you try to run the DevKit in a cooler environment?

Here is a bigger fan: XHG306 - Active heatsink for the NVIDIA Jetson AGX Xavier production module

Hi,

Quick update on the rebooting issue: the unit was RMA’d and the new unit is doing great (load testing the CPU, GPU) and running no machine with a client session. No more IRQ, heat or network issue.

The new unit is the same version ( head -1 /etc/nv_tegra_release
R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020) and the same Jetpack version 4.4.

I would like to thank @linuxdev and @ynjiun for their help on this issue.

Cheers
Simon