Jetson AGX Xavier self rebooting

I was using the Ubuntu VNC server for UI remote access. Since then I have Unminized the DevKit and only access the unit with ssh.

Thanks
Simon

That is a good point. I will find a utility that can log that info and see what I can find.

Thanks
Simon

For reference, the other thread which is technically a separate thread, but perhaps related to this, is:
https://forums.developer.nvidia.com/t/agx-xavier-kept-rebooting-after-crash/153758

(both threads have a similar network error, but not exact errors, and both could have different thermal and power issues)

Thanks @linuxdev for the follow up.

I might be missing something but if it is a network stack issue with a stable Ubuntu version (18.04 LTS), talk about network instability! I happen to have a 10 year laptop/router that bridges to the Internet router, with the exact same Ubuntu version and it has been running non-stop for weeks.

This is not a Raspberry Pi that I can just replace for 100$.

I wish someone from Nvidia could weight in here because I am starting to feel I have a faulty unit with little recourse to resolve the issue.

Thanks
Simon

@linuxdev: I am monitoring the network with nload.

Will keep you posted.

Thanks
Simon

Hi linuxdev, I think you are on something here. I purposely disable the AGX Xavier network and re-run all the apps that cause GPU > 45C again, and won’t be able to crash the system (or cause the self reboot) anymore…

if so, what’s the fix would be recommended? because in our product we need network to be on for transmitting out the result in real time…

Thanks a lot for your help.

Wow, excellent test !!!

Time to look at a Wifi solution. There is a M2 card slot on the dev kit.

Intro here: Jetson Nano + Intel Wifi and Bluetooth - JetsonHacks.

Hi @ynjiun and @linuxdev

Chatted with support and my only option is to return the unit. Ah well …

Thanks for your time
Simon

just curious, when did you buy your unit?

Hi @ynjiun,

July 27, 2020.

Cheers
Simon

Could you share the JP version you are working on?

You may get with $ head -1 /etc/nv_tegra_release

R32 (release), REVISION: 4.3, GCID: 21589087, BOARD: t186ref, EABI: aarch64, DATE: Fri Jun 26 04:34:27 UTC 2020

interesting. I bought it in June but has the same JP version as yours. (No wonder it behaves “similar”). Before you return your unit, if you don’t mind, would you like to do one test to see if the self reboot disappear? The test is: whatever you are running before that cause the system self reboot, run the same test with network disable. (To disable network, just click the top right corner network icon and click “disconnect” below wired connect 1.) This might “prove” it may be network causing all these trouble.

Once you get your new unit, please let me know if this self reboot problem disappear. Thanks.

Sure, no problem. I will run my stuff for a couple of days and get back to you.

This is aarch64/arm64 architecture, so the code actually running is quite different than the PC.

The RPi is a good example though since it is running the same architecture (I am assuming this is the 64-bit version). Does the RPi use the same desktop window manager? Does the RPi have the same NoMachine installed? The questions are designed to find a trigger to the problem.

Finding a way to reproduce the problem for NVIDIA to test implies it important to consider such things as whether or not NoMachine is involved in the failure. If the bug can be reproduced, then a solution is far faster/easier.

@linuxdev, I have no issue with you helping to narrow down the issue @ynjiun and I are experiencing. It’s all good :-)

I am trying to reproduce the network error by running from a Mac on the same network:
1 - ping -f 192.168.5.6 (flood 56 bytes)
2 - ping -c 10 -g 1048 -G 4096 -h 100 192.168.5.6

On the devkit, nload on eth0 currently displays:
Incoming
Curr: 9.71 MBit/s
Avg: 9.16 MBit/s
Min: 824.00 Bit/s
Max: 14.28 MBit/s
Ttl: 2.54 GByte

Outgoing
Curr: 11.36 MBit/s
Avg: 10.75 MBit/s
Min: 6.79 kBit/s
Max: 16.73 MBit/s
Ttl: 2.98 GByte

3 - overloading the CPUs with a python script running on the devkit:
top - 16:14:35 up 4:05, 10 users, load average: 8.19, 8.11, 6.83
Tasks: 340 total, 9 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 88.4 us, 3.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 2.8 hi, 5.4 si, 0.0 st
KiB Mem : 32692948 total, 31291584 free, 696116 used, 705248 buff/cache
KiB Swap: 16346464 total, 16346464 free, 0 used. 31625048 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8804 simon 20 0 16744 7000 2164 R 99.0 0.0 27:09.19 python
8802 simon 20 0 16744 7000 2164 R 98.7 0.0 26:00.08 python
8801 simon 20 0 16744 7000 2164 R 98.4 0.0 27:12.98 python
8808 simon 20 0 16744 7004 2164 R 98.4 0.0 27:17.55 python
8803 simon 20 0 16744 7000 2164 R 98.1 0.0 23:42.39 python
8807 simon 20 0 16744 7004 2164 R 98.1 0.0 27:14.02 python
8806 simon 20 0 16744 7004 2164 R 96.8 0.0 20:16.34 python
8805 simon 20 0 16744 7004 2164 R 39.3 0.0 26:08.14 python
7928 simon 20 0 9268 3740 2848 R 1.6 0.0 0:43.43 top
3 root 20 0 0 0 0 S 1.3 0.0 0:43.24 ksoftirqd/0

Current tegrastats is:
RAM 745/31927MB (lfb 7628x4MB) SWAP 0/15963MB (cached 0MB) CPU [100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190,100%@1190] EMC_FREQ 0% GR3D_FREQ 0% AO@44.5C GPU@46C Tdiode@47.25C PMIC@100C AUX@44.5C CPU@46C thermal@45.6C Tboard@44C GPU 0/0 CPU 3729/3354 SOC 1087/955 CV 155/155 VDDRQ 465/404 SYS5V 1284/1279

The CPU is currently at 46C which seems to contradict @ynjiun results.

I am now looking for a way to load the GPU…

Cheers
Simon

The above says a lot. 100 degrees C is rather high. Boiling hot. This is the Power Management IC. At the same time every single CPU is at 100%…not that this is an error per se, but it implies more power consumption, and thus more stress on the PMIC. I have no idea if the networking hardware is physically near the PMIC or not, but perhaps this too is heating up.

Note that ping uses ICMP protocol, which is quite different than the TCP or UDP of the virtual desktop software. However, the network was showing a stack dump after being declared a “soft” lockup. The implication is that it was not responding, and the assumption is that there was a code error. However, lack of response could also be due to either IRQ starvation or simply having the system under such a huge load that response was truly that slow. If this latter is the case, then the network stack dump is just the first symptom, and not the actual cause.

Anything other than keyboard/mouse/monitor you might have connected, and which it consumes any significant power from the Jetson, can be tested as a contribution to this if you have a means to provide external power to the device instead of powering directly from the Jetson.

you may pretty much ignore PMIC@100C, since day 1, no matter what situation, that number seems a constant never change, so my guess that temperature is fake and never been updated.

you may install deepstream SDK and there are many GPU loaded apps under /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/

As @linuxdev pointed out that ICMP network traffic was indeed very different from usual TPC traffic. I just put on a youtube video and watched on a VNC client.

And it crashed/rebooted with this:
Sep 7 17:08:46 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:08:46 PM Enabling NewFBSize protocol extension for client 192.168.5.2
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM Pixel format for client 192.168.5.2:
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM 8 bpp, depth 6
Sep 7 17:09:31 simon-desktop vino-server.desktop[11517]: 07/09/2020 05:09:31 PM true colour: max r 3 g 3 b 3, shift r 4 g 2 b 0
Sep 7 17:09:36 simon-desktop kernel: [18034.074957] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:36 simon-desktop kernel: [18034.075154] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:09:46 simon-desktop kernel: [18043.773803] bpmp: mrq 22 took 1284000 us
Sep 7 17:09:51 simon-desktop kernel: [18048.362018] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:51 simon-desktop kernel: [18048.362232] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:09:57 simon-desktop kernel: [18054.656631] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:09:57 simon-desktop kernel: [18054.656835] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:10:02 simon-desktop kernel: [18059.497433] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2722 [ERR] semaphore acquire timeout!
Sep 7 17:10:02 simon-desktop kernel: [18059.497640] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 505
Sep 7 17:11:16 simon-desktop kernel: [18133.444566] INFO: rcu_preempt self-detected stall on CPU
Sep 7 17:11:16 simon-desktop kernel: [18133.444750] 0-…: (2 GPs behind) idle=07b/140000000000002/0 softirq=388653/388653 fqs=2329
Sep 7 17:11:16 simon-desktop kernel: [18133.444918] (t=5250 jiffies g=108392 c=108391 q=19651)
Sep 7 17:11:16 simon-desktop kernel: [18133.445041] Task dump for CPU 0:
Sep 7 17:11:16 simon-desktop kernel: [18133.445048] swapper/0 R running task 0 0 0 0x00000002
Sep 7 17:11:16 simon-desktop kernel: [18133.445068] Call trace:
Sep 7 17:11:16 simon-desktop kernel: [18133.445093] [] dump_backtrace+0x0/0x198
Sep 7 17:11:16 simon-desktop kernel: [18133.445103] [] show_stack+0x24/0x30
Sep 7 17:11:16 simon-desktop kernel: [18133.445116] [] sched_show_task+0xf8/0x148
Sep 7 17:11:16 simon-desktop kernel: [18133.445125] [] dump_cpu_task+0x48/0x58
Sep 7 17:11:16 simon-desktop kernel: [18133.445139] [] rcu_dump_cpu_stacks+0xb8/0xec
Sep 7 17:11:16 simon-desktop kernel: [18133.445151] [] rcu_check_callbacks+0x728/0xa48
Sep 7 17:11:16 simon-desktop kernel: [18133.445161] [] update_process_times+0x34/0x60
Sep 7 17:11:16 simon-desktop kernel: [18133.445173] [] tick_sched_handle.isra.5+0x38/0x70
Sep 7 17:11:16 simon-desktop kernel: [18133.445180] [] tick_sched_timer+0x4c/0x90
Sep 7 17:11:16 simon-desktop kernel: [18133.445187] [] __hrtimer_run_queues+0xd8/0x360
Sep 7 17:11:16 simon-desktop kernel: [18133.445194] [] hrtimer_interrupt+0xa8/0x1e0
Sep 7 17:11:16 simon-desktop kernel: [18133.445209] [] arch_timer_handler_phys+0x38/0x58
Sep 7 17:11:16 simon-desktop kernel: [18133.445220] [] handle_percpu_devid_irq+0x90/0x2b0
Sep 7 17:11:16 simon-desktop kernel: [18133.445228] [] generic_handle_irq+0x34/0x50
Sep 7 17:11:16 simon-desktop kernel: [18133.445234] [] __handle_domain_irq+0x68/0xc0
Sep 7 17:11:16 simon-desktop kernel: [18133.445242] [] gic_handle_irq+0x5c/0xb0
Sep 7 17:11:16 simon-desktop kernel: [18133.445248] [] el1_irq+0xe8/0x194
Sep 7 17:11:16 simon-desktop kernel: [18133.445265] [] tcp_wfree+0xa4/0x130
Sep 7 17:11:16 simon-desktop kernel: [18133.445277] [] skb_release_head_state+0x68/0xf8
Sep 7 17:11:16 simon-desktop kernel: [18133.445284] [] skb_release_all+0x20/0x40
Sep 7 17:11:16 simon-desktop kernel: [18133.445292] [] consume_skb+0x38/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445302] [] __dev_kfree_skb_any+0x54/0x60
Sep 7 17:11:16 simon-desktop kernel: [18133.445315] [] tx_swcx_free+0x68/0xe0
Sep 7 17:11:16 simon-desktop kernel: [18133.445322] [] eqos_napi_poll_tx+0x2b0/0x4f8
Sep 7 17:11:16 simon-desktop kernel: [18133.445329] [] net_rx_action+0xf4/0x358
Sep 7 17:11:16 simon-desktop kernel: [18133.445336] [] __do_softirq+0x13c/0x3b0
Sep 7 17:11:16 simon-desktop kernel: [18133.445348] [] irq_exit+0xd0/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445355] [] __handle_domain_irq+0x6c/0xc0
Sep 7 17:11:16 simon-desktop kernel: [18133.445361] [] gic_handle_irq+0x5c/0xb0
Sep 7 17:11:16 simon-desktop kernel: [18133.445368] [] el1_irq+0xe8/0x194
Sep 7 17:11:16 simon-desktop kernel: [18133.445376] [] tick_nohz_idle_exit+0xe8/0x118
Sep 7 17:11:16 simon-desktop kernel: [18133.445384] [] cpu_startup_entry+0xe8/0x200
Sep 7 17:11:16 simon-desktop kernel: [18133.445395] [] rest_init+0x84/0x90
Sep 7 17:11:16 simon-desktop kernel: [18133.445410] [] start_kernel+0x370/0x384
Sep 7 17:11:16 simon-desktop kernel: [18133.445418] [] __primary_switched+0x80/0x94
Sep 7 17:11:16 simon-desktop kernel: [18133.452546] INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 7 17:11:16 simon-desktop kernel: [18133.452697] 0-…: (2 GPs behind) idle=07b/140000000000002/0 softirq=388653/388653 fqs=2286
Sep 7 17:11:16 simon-desktop kernel: [18133.452838] (detected by 2, t=5252 jiffies, g=74604, c=74603, q=314)
Sep 7 17:11:16 simon-desktop kernel: [18133.452957] Task dump for CPU 0:
Sep 7 17:11:16 simon-desktop kernel: [18133.452962] swapper/0 R running task 0 0 0 0x00000002

Could the GPU be at fault here? No working well with the DevKit?

As @ynjiun noticed, everything seems to be stable with no network. I will run some CPU/GPU stress tests with no network just to see if stability can be reproduced! :-/ Yeah, we are trying to reproduce stability!

Thanks
Simon