Jetson AGX Xavier self rebooting

Sure, no problem. I will run my stuff for a couple of days and get back to you.

This is aarch64/arm64 architecture, so the code actually running is quite different than the PC.

The RPi is a good example though since it is running the same architecture (I am assuming this is the 64-bit version). Does the RPi use the same desktop window manager? Does the RPi have the same NoMachine installed? The questions are designed to find a trigger to the problem.

Finding a way to reproduce the problem for NVIDIA to test implies it important to consider such things as whether or not NoMachine is involved in the failure. If the bug can be reproduced, then a solution is far faster/easier.

@linuxdev, I have no issue with you helping to narrow down the issue @ynjiun and I are experiencing. It’s all good :-)

I am trying to reproduce the network error by running from a Mac on the same network:
1 - ping -f 192.168.5.6 (flood 56 bytes)
2 - ping -c 10 -g 1048 -G 4096 -h 100 192.168.5.6

On the devkit, nload on eth0 currently displays:
Incoming
Curr: 9.71 MBit/s
Avg: 9.16 MBit/s
Min: 824.00 Bit/s
Max: 14.28 MBit/s
Ttl: 2.54 GByte

Outgoing
Curr: 11.36 MBit/s
Avg: 10.75 MBit/s
Min: 6.79 kBit/s
Max: 16.73 MBit/s
Ttl: 2.98 GByte

3 - overloading the CPUs with a python script running on the devkit:
top - 16:14:35 up 4:05, 10 users, load average: 8.19, 8.11, 6.83
Tasks: 340 total, 9 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 88.4 us, 3.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.8 hi, 5.4 si, 0.0 st
KiB Mem : 32692948 total, 31291584 free, 696116 used, 705248 buff/cache
KiB Swap: 16346464 total, 16346464 free, 0 used. 31625048 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8804 simon 20 0 16744 7000 2164 R 99.0 0.0 27:09.19 python
8802 simon 20 0 16744 7000 2164 R 98.7 0.0 26:00.08 python
8801 simon 20 0 16744 7000 2164 R 98.4 0.0 27:12.98 python
8808 simon 20 0 16744 7004 2164 R 98.4 0.0 27:17.55 python
8803 simon 20 0 16744 7000 2164 R 98.1 0.0 23:42.39 python
8807 simon 20 0 16744 7004 2164 R 98.1 0.0 27:14.02 python
8806 simon 20 0 16744 7004 2164 R 96.8 0.0 20:16.34 python
8805 simon 20 0 16744 7004 2164 R 39.3 0.0 26:08.14 python
7928 simon 20 0 9268 3740 2848 R 1.6 0.0 0:43.43 top
3 root 20 0 0 0 0 S 1.3 0.0 0:43.24 ksoftirqd/0

Current tegrastats is:
RAM 745/31927MB (lfb 7628x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190] EMC_FREQ 0% GR3D_FREQ 0% AO@44.5C GPU@46C Tdiode@47.25C PMIC@100C AUX@44.5C CPU@46C thermal@45.6C Tboard@44C GPU 0/0 CPU 3729/3354 SOC 1087/955 CV 155/155 VDDRQ 465/404 SYS5V 1284/1279

The CPU is currently at 46C which seems to contradict @ynjiun results.

I am now looking for a way to load the GPU…

Cheers
Simon

The above says a lot. 100 degrees C is rather high. Boiling hot. This is the Power Management IC. At the same time every single CPU is at 100%…not that this is an error per se, but it implies more power consumption, and thus more stress on the PMIC. I have no idea if the networking hardware is physically near the PMIC or not, but perhaps this too is heating up.

Note that ping uses ICMP protocol, which is quite different than the TCP or UDP of the virtual desktop software. However, the network was showing a stack dump after being declared a “soft” lockup. The implication is that it was not responding, and the assumption is that there was a code error. However, lack of response could also be due to either IRQ starvation or simply having the system under such a huge load that response was truly that slow. If this latter is the case, then the network stack dump is just the first symptom, and not the actual cause.

Anything other than keyboard/mouse/monitor you might have connected, and which it consumes any significant power from the Jetson, can be tested as a contribution to this if you have a means to provide external power to the device instead of powering directly from the Jetson.

you may pretty much ignore PMIC@100C, since day 1, no matter what situation, that number seems a constant never change, so my guess that temperature is fake and never been updated.

you may install deepstream SDK and there are many GPU loaded apps under /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/

As @linuxdev pointed out that ICMP network traffic was indeed very different from usual TPC traffic. I just put on a youtube video and watched on a VNC client.

And it crashed/rebooted with this:
Sep 7 17:08:46 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:08:46 PM Enabling NewFBSize protocol extension for client 192.168.5.2
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM Pixel format for client 192.168.5.2:
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM 8 bpp, depth 6
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM true colour: max r 3 g 3 b 3, shift r 4 g 2 b 0
Sep 7 17:09:36 simon-desktop kernel: [18034.074957] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:36 simon-desktop kernel: [18034.075154] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:09:46 simon-desktop kernel: [18043.773803] bpmp: mrq 22 took 1284000 us
Sep 7 17:09:51 simon-desktop kernel: [18048.362018] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:51 simon-desktop kernel: [18048.362232] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:09:57 simon-desktop kernel: [18054.656631] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:57 simon-desktop kernel: [18054.656835] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:10:02 simon-desktop kernel: [18059.497433] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:10:02 simon-desktop kernel: [18059.497640] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:11:16 simon-desktop kernel: [18133.444566] INFO: rcu_preempt self-detected stall on CPU
Sep 7 17:11:16 simon-desktop kernel: [18133.444750] 0-…: (2 GPs behind) idle=07b/140000000000002/0 softirq=388653/388653 fqs=2329
Sep 7 17:11:16 simon-desktop kernel: [18133.444918] (t=5250 jiffies g=108392 c=108391 q=19651)
Sep 7 17:11:16 simon-desktop kernel: [18133.445041] Task dump for CPU 0:
Sep 7 17:11:16 simon-desktop kernel: [18133.445048] swapper/0 R running task 0 0 0 0x00000002
Sep 7 17:11:16 simon-desktop kernel: [18133.445068] Call trace:
Sep 7 17:11:16 simon-desktop kernel: [18133.445093] [] dump_backtrace+0x0/0x198
Sep 7 17:11:16 simon-desktop kernel: [18133.445103] [] show_stack+0x24/0x30
Sep 7 17:11:16 simon-desktop kernel: [18133.445116] [] sched_show_task+0xf8/0x148
Sep 7 17:11:16 simon-desktop kernel: [18133.445125] [] dump_cpu_task+0x48/0x58
Sep 7 17:11:16 simon-desktop kernel: [18133.445139] [] rcu_dump_cpu_stacks+0xb8/0xec
Sep 7 17:11:16 simon-desktop kernel: [18133.445151] [] rcu_check_callbacks+0x728/0xa48
Sep 7 17:11:16 simon-desktop kernel: [18133.445161] [] update_process_times+0x34/0x60
Sep 7 17:11:16 simon-desktop kernel: [18133.445173] [] tick_sched_handle.isra.5+0x38/0x70
Sep 7 17:11:16 simon-desktop kernel: [18133.445180] [] tick_sched_timer+0x4c/0x90
Sep 7 17:11:16 simon-desktop kernel: [18133.445187] [] __hrtimer_run_queues+0xd8/0x360
Sep 7 17:11:16 simon-desktop kernel: [18133.445194] [] hrtimer_interrupt+0xa8/0x1e0
Sep 7 17:11:16 simon-desktop kernel: [18133.445209] [] arch_timer_handler_phys+0x38/0x58
Sep 7 17:11:16 simon-desktop kernel: [18133.445220] [] handle_percpu_devid_irq+0x90/0x2b0
Sep 7 17:11:16 simon-desktop kernel: [18133.445228] [] generic_handle_irq+0x34/0x50
Sep 7 17:11:16 simon-desktop kernel: [18133.445234] [] __handle_domain_irq+0x68/0xc0
Sep 7 17:11:16 simon-desktop kernel: [18133.445242] [] gic_handle_irq+0x5c/0xb0
Sep 7 17:11:16 simon-desktop kernel: [18133.445248] [] el1_irq+0xe8/0x194
Sep 7 17:11:16 simon-desktop kernel: [18133.445265] [] tcp_wfree+0xa4/0x130
Sep 7 17:11:16 simon-desktop kernel: [18133.445277] [] skb_release_head_state+0x68/0xf8
Sep 7 17:11:16 simon-desktop kernel: [18133.445284] [] skb_release_all+0x20/0x40
Sep 7 17:11:16 simon-desktop kernel: [18133.445292] [] consume_skb+0x38/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445302] [] __dev_kfree_skb_any+0x54/0x60
Sep 7 17:11:16 simon-desktop kernel: [18133.445315] [] tx_swcx_free+0x68/0xe0
Sep 7 17:11:16 simon-desktop kernel: [18133.445322] [] eqos_napi_poll_tx+0x2b0/0x4f8
Sep 7 17:11:16 simon-desktop kernel: [18133.445329] [] net_rx_action+0xf4/0x358
Sep 7 17:11:16 simon-desktop kernel: [18133.445336] [] __do_softirq+0x13c/0x3b0
Sep 7 17:11:16 simon-desktop kernel: [18133.445348] [] irq_exit+0xd0/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445355] [] __handle_domain_irq+0x6c/0xc0
Sep 7 17:11:16 simon-desktop kernel: [18133.445361] [] gic_handle_irq+0x5c/0xb0
Sep 7 17:11:16 simon-desktop kernel: [18133.445368] [] el1_irq+0xe8/0x194
Sep 7 17:11:16 simon-desktop kernel: [18133.445376] [] tick_nohz_idle_exit+0xe8/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445384] [] cpu_startup_entry+0xe8/0x200
Sep 7 17:11:16 simon-desktop kernel: [18133.445395] [] rest_init+0x84/0x90
Sep 7 17:11:16 simon-desktop kernel: [18133.445410] [] start_kernel+0x370/0x384
Sep 7 17:11:16 simon-desktop kernel: [18133.445418] [] __primary_switched+0x80/0x94
Sep 7 17:11:16 simon-desktop kernel: [18133.452546] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 7 17:11:16 simon-desktop kernel: [18133.452697] 0-…: (2 GPs behind) idle=07b/140000000000002/0 softirq=388653/388653 fqs=2286
Sep 7 17:11:16 simon-desktop kernel: [18133.452838] (detected by 2, t=5252 jiffies, g=74604, c=74603, q=314)
Sep 7 17:11:16 simon-desktop kernel: [18133.452957] Task dump for CPU 0:
Sep 7 17:11:16 simon-desktop kernel: [18133.452962] swapper/0 R running task 0 0 0 0x00000002

Could the GPU be at fault here? No working well with the DevKit?

As @ynjiun noticed, everything seems to be stable with no network. I will run some CPU/GPU stress tests with no network just to see if stability can be reproduced! :-/ Yeah, we are trying to reproduce stability!

Thanks
Simon

one more vote on network stability is questionable.

Just found this network kernel patch

I am debating whether to install or not to see if this fix “network stability” issue or not…

@ynjiun: if you tell me how, I will install the patch.

just curious what power mode you are using while experiencing all these self reboot: MAXN, 30W ALL, etc.? or you did run “sudo jetsen_clocks” every time when boot up?

I have tried all modes and it doesn’t seem to matter.

How about the patch? I checked out the file but it’s more or less an email with a code diff. I am not sure how to apply it.

In any case, the reboot seems to be happening when the GPU is involved (video display, DL).

Why that? Because the 8 cores of the CPU were at 100% for 2 hours, reaching 47C with no issue. But stopping the CPU load and displaying a youtube video, in full screen HD, triggered the reboot in the next hour. I working on something else so I am not sure how long it took to crash.

From the patch, it could be a sync issue…

Hi @linuxdev

I think I found the issue: GPU overheat.

The default fan setting is quiet which has a trip temp of 46C. I changed the setting to cool which has a trip temperature of 35C with:
sudo nvpmodel -d cool

Since then, the devkit has been playing youtube HD full screen videos non-stop with no issue.

Here is the latest tegrastats:
RAM 2440/31925MB (lfb 6939x4MB) SWAP 0/15963MB (cached 0MB) CPU [31%@2265,27%@2265,22%@2265,24%@2265,31%@2265,38%@2265,36%@2265,43%@2265] EMC_FREQ 0% GR3D_FREQ 28% AO@34C GPU@34.5C Tdiode@36.5C PMIC@100C AUX@34C CPU@36C thermal@34.95C Tboard@34C GPU 619/670 CPU 4183/3586 SOC 2788/2544 CV 154/154 VDDRQ 929/897 SYS5V 2564/2474

Thanks
Simon

@ynjiun : That would tend to imply 100C is a trip temperature, rather than actual temperature, so you are correct about that. I went and examined a couple of Jetsons and they all had that behavior. Some of the temperature monitoring only tells you about trip point, rather than being an actual measurement, and this is apparently one of them.

@simon.glet , that appears to be the same error. What kind of hardware was the video from? I could see the possibility of any virtual desktop making custom adjustments to networking and triggering something which is not commonly occurring in most situations (a corner case). This particular case also shows (as you mentioned) some GPU involvement higher up in the stack frame, and then below this in the stack frame are the same network problems. If you have a URL to the video or more information it would help.

What makes this more recent stack frame interesting is that GPU calls were made after network calls, which would make sense if network data is driving GPU activity. In the previous cases which were posted the GPU activity was not necessarily present in the stack frame. There is a strong chance that the GPU is just another way the bug shows up, and is not necessarily the original cause. A network error should be correctable, but seems to cause rebooting; however, perhaps the GPU driver also is not handling the error condition which has been passed to it.

The first function call which starts something “specific” in the failure is this:

Sep 7 17:11:16 simon-desktop kernel: [18133.445329] [] net_rx_action+0xf4/0x358

…the GPU has not even been involved yet at that point in the stack frame. After some network activity there is another IRQ, and timers start failing. The GPU errors are part of normal logging, and not part of the stack frame, but the GPU error apparently is going on while the stack frame is being dumped:

Sep 7 17:10:02 simon-desktop kernel: [18059.497433] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:10:02 simon-desktop kernel: [18059.497640] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505

I am inclined to believe that the GPU error message is just a side effect of network code gone wrong. The error is that the GPU is in need of acquiring a semaphore, but cannot. This is out of the control of the GPU, and is the result of something else blocking this. It is a bit like driving up to a gas station to refill the car, but there is a line of hundreds of people in front, and one of them has a dead engine…nobody behind that car could access the gas even if some is available.

If you can provide a way to replicate this, then someone from NVIDIA could probably go straight into the stack frame and find the specific network condition which is stalling out. This issue is part of networking, but it is interfering with the GPU when those virtual desktops are involved.

@ynjiun and @simon.glet: This is a good idea (perhaps both of the people with issues could apply this patch and try again):

…I think you’ve just found one of the triggers to the same network issue, and if that patch worked for the other soft lockup, then it will very likely work with virtual desktop network issues as well.

FYI, in theory, if the soft lock is something which is just too high of a load, then running max performance could help, but only to an extent. If there is a software bug causing the soft lockup, then there is no possibility of performance modes helping. Either way the real solution is to stop the soft lockup (and it looks like the patch above is most likely the fix).

I do not think that GPU temperature is the cause. Keep in mind that if the system is running in a lower performance mode that the timers which deal with whether or not there is a soft lockup can also begin later…if there is some sort of data required to send to the GPU, then it is already running prior to the GPU ever trying to use that data. Running in lower performance mode could actually give the data more time to go through the system prior to the soft lockup timer being started. I think the earlier mentioned patch is on target:
https://forums.developer.nvidia.com/t/xavier-with-jp4-2-hangs/72014/8

@linuxdev

Being as it may, the DevKit has not rebooted since I started to the test 3.5 hours ago, running at MaxN. It is a stability record! :-)

The unit might not have to be returned which is great news.

I opened a bug ticket (3119509) about the fan setting change and hope for the best.

Thank you all for your help.
Cheers
Simon

Hi simon, did you apply the above mentioned patch?

Hi @ynjiun

No, I did not.

hmmm… that means so far the only thing you did is:

sudo nvpmodel -d cool

interesting. I did that, but still self reboot.

By the way, how did you contact the support? Do you know which phone number to call? thanks for sharing.

@ynjiun,

I am sorry to hear that your unit is still rebooting … Be aware that by rebooting the fan cooling goes back to its default (quiet). For more info. please checkout: https://docs.nvidia.com/jetson/l4t/#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fpower_management_jetson_xavier.html%23wwpID0E03M0HA

I read this thread of someone that has rebooting issues: We need the Industry Grade (-40 ~ + 85°C) AGX Xavier module.

Based on that thread and you mentioning your current working temperature was around 28C, your GPU might still be overheating.

Did you try to run the DevKit in a cooler environment?

Here is a bigger fan: https://www.siliconhighwaydirect.co.uk/product-p/xhg306.htm

Hi,

Quick update on the rebooting issue: the unit was RMA’d and the new unit is doing great (load testing the CPU, GPU) and running no machine with a client session. No more IRQ, heat or network issue.

The new unit is the same version ( head -1 /etc/nv_tegra_release
R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020) and the same Jetpack version 4.4.

I would like to thank @linuxdev and @ynjiun for their help on this issue.

Cheers
Simon