AGX Xavier resets even when idle

My Xavier keeps resetting itself. This happens on the original power supply. Resets happen when the system is completely idle.

Please dump the log when reset happens.

FYI, the way to log dump is to run “dmesg --follow” inside of the serial console with logging enabled (using the micro-B USB cable).

Thanks, I could connect the serial port and observe the system behavior in minicom. Here are the lines before reboot. Note that there are many lines like 901-904 before 901.
Only a usb wifi and one NVMe SSD are connected to Xavier. System was completely idle.

901 [ 4814.354223] ata1: SError: { CommWake DevExch }
902 [ 4814.354360] ata1: limiting SATA link speed to 1.5 Gbps
903 [ 4816.312499] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
904 [ 4816.312677] ata1: irq_stat 0x80000040, connection status changed
905 [ 4816.312791] ata1: SError: { CommWake DevExch }
906 [ 4816.312923] ata1: limiting SATA link speed to 1.5 Gbps
907 [ 4817.349464] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
908 [ 4817.349640] ata1: irq_stat 0x80000040, connection status changed
909 [ 4817.349757] ata1: SError: { CommWake DevExch }
910 [ 4817.349883] ata1: limiting SATA link speed to 1.5 Gbps
911 [ 4818.371356] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
912 [ 4818.371530] ata1: irq_stat 0x80000040, connection status changed
913 [ 4818.371645] ata1: SError: { CommWake DevExch }
914 [ 4818.371773] ata1: limiting SATA link speed to 1.5 Gbps
915 [ 4823.365381] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2713 [ERR] semaphore acquire timeout!
916 [ 4823.365606] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 509
917 [ 4823.454489] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
918 [ 4823.454667] ata1: irq_stat 0x80000040, connection status changed
919 [ 4823.454860] ata1: SError: { CommWake DevExch }
920 [ 4823.454992] ata1: limiting SATA link speed to 1.5 Gbps
921 [ 4824.451878] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
922 [ 4824.452065] ata1: irq_stat 0x80000040, connection status changed
923 [ 4824.452180] ata1: SError: { CommWake DevExch }
924 [ 4824.452311] ata1: limiting SATA link speed to 1.5 Gbps
925 [ 4824.998700] INFO: rcu_sched detected stalls on CPUs/tasks:
926 [ 4824.998863] 0-…: (1 GPs behind) idle=f99/140000000000002/0 softirq=74216/74276 fqs=2127
927 [ 4824.999047] (detected by 2, t=5252 jiffies, g=2932, c=2931, q=6)
928 [ 4825.465074] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
929 [ 4825.465248] ata1: irq_stat 0x80000040, connection status changed
930 [ 4825.465392] ata1: SError: { CommWake DevExch }
931 [ 4825.465522] ata1: limiting SATA link speed to 1.5 Gbps
932 [ 4826.604457] ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
933 [ 4826.604639] ata1: irq_stat 0x80000040, connection status changed
934 [ 4826.604756] ata1: SError: { CommWake DevExch }
935 [ 4826.604889] ata1: limiting SATA link speed to 1.5 Gbps
936 [ 4827.494300] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2713 [ERR] semaphore acquire timeout!
937 [ 4827.494508] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 509
938 [ 4831.538580] nvgpu: 17000000.gv11b gk20a_fifo_handle_pbdma_intr_0:2713 [ERR] semaphore acquire timeout!
939 [ 4831.538799] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 24 for ch 509
940 [ 4832.326523] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:0:11009]
941 [ 4832.327296] Kernel panic - not syncing: softlockup: hung tasks
942 [ 4832.327448] CPU: 0 PID: 11009 Comm: kworker/0:0 Tainted: G C L 4.9.140-tegra #1
943 [ 4832.327600] Hardware name: Jetson-AGX (DT)
944 [ 4832.327694] Workqueue: events rtcpu_trace_worker
945 [ 4832.327791] Call trace:
946 [ 4832.327881] [] dump_backtrace+0x0/0x198
947 [ 4832.327984] [] show_stack+0x24/0x30
948 [ 4832.328107] [] dump_stack+0x98/0xc0
949 [ 4832.328235] [] panic+0x11c/0x298
950 [ 4832.328333] [] watchdog_unpark_threads+0x0/0x98
[ 4832.328469] [] __hrtimer_run_queues+0xd8/0x360
952 [ 4832.328614] [] hrtimer_interrupt+0xa8/0x1e0
953 [ 4832.328719] [] arch_timer_handler_phys+0x38/0x58
954 [ 4832.328862] [] handle_percpu_devid_irq+0x90/0x2b0
955 [ 4832.329086] [] generic_handle_irq+0x34/0x50
956 [ 4832.329520] [] __handle_domain_irq+0x68/0xc0
957 [ 4832.331460] [] gic_handle_irq+0x5c/0xb0
958 [ 4832.336714] [] el1_irq+0xe8/0x18c
959 [ 4832.341444] [] rtw_check_bcn_info+0x12c/0x7c8 [r8188eu]
960 [ 4832.348340] [] report_add_sta_event+0x330/0xf00 [r8188eu]
961 [ 4832.355248] [] rtw_roaming+0x129c/0x1510 [r8188eu]
962 [ 4832.361380] [] mgt_dispatcher+0x180/0x298 [r8188eu]
963 [ 4832.368023] [] rtw_recv_entry+0x368/0xa38 [r8188eu]
964 [ 4832.374338] [] rtl8188eu_recv_tasklet+0xc8/0x650 [r8188eu]
965 [ 4832.381511] [] tasklet_action+0x70/0x108
966 [ 4832.387020] [] __do_softirq+0x13c/0x3b0
967 [ 4832.392536] [] irq_exit+0xd0/0x118
968 [ 4832.397265] [] __handle_domain_irq+0x6c/0xc0
969 [ 4832.403297] [] gic_handle_irq+0x5c/0xb0
970 [ 4832.408460] [] el1_irq+0xe8/0x18c
971 [ 4832.413451] [] queue_delayed_work_on+0x50/0x88
972 [ 4832.419747] [] rtcpu_trace_worker+0x3c/0x48
973 [ 4832.425431] [] process_one_work+0x1e4/0x4b0
974 [ 4832.431031] [] worker_thread+0x50/0x4c8
975 [ 4832.436458] [] kthread+0xec/0xf0
976 [ 4832.441100] [] ret_from_fork+0x10/0x40
977 [ 4832.446876] SMP: stopping secondary CPUs
978 [ 4832.450564] Kernel Offset: disabled
979 [ 4832.454219] Memory Limit: none
980 [ 4832.457629] trusty-log panic notifier - trusty version Built: 22:43:40 Dec 9 2019 [ 4832.480843] Rebooting in 5 sec.
981 ����Shutdown state requested 1
982 Rebooting system …

I don’t know enough to debug this, but you have a lot of this:
ata1: SError: { CommWake DevExch }

I saw one reference suggesting a bad cable, but more likely another reference was valid, and can be a result of the code needing an update for the newer hardware revision of the ARM CPU core. I cannot guarantee this is the issue, but wanted to point it out for those looking at it (notice in the URL that they are working ARMv8.2, but I’m not sure if this applies to the Xavier):
https://patchwork.kernel.org/project/kvm/patch/1490869877-118713-8-git-send-email-xiexiuqi@huawei.com/

The link you have shared is well beyond of my knowledge!
In my case soft lockup is leading to a panic. Do you think soft lock is due to `ata1: SError: { CommWake DevExch }?
I checked and the way lockups trigger panic in Xavier and my and pc are completely different:

Xavier:
$ sysctl kernel.softlockup_panic kernel.hardlockup_panic
kernel.softlockup_panic = 1
kernel.hardlockup_panic = 1

PC
$ sysctl kernel.softlockup_panic kernel.hardlockup_panic
kernel.softlockup_panic = 0
kernel.hardlockup_panic = 0

Shall I go ahead and force 0s in Xavier too or it is risky?

Hi,

Looks like the NVMe drive is the cause. Could you share the full dmesg file as text file attachment here?

Also, what NVMe disk are you using?

Hi.
NVMe is 512GB WD (WDS500G2X0C).
I re-flashed my jetson and now the warnings about NVMe are gone. However, system still reboots! I can observe unit’s activity with minicom on my pc and that’s how I created the log I shared above. Please educate me how to create dmesg log from xavier. When I use dmesg --follow >> mylog from my pc it writes content of pc’s dmesg. I cann’t type anything inside minicom. Do you want me to run the command from Xavier after a reset?

Attached please find the log from minicom (reboot happens around line 659) and dmesg log from Xavier after reset happened. It looks like a stall in cpu0 is causing the reboot.

I believe the unit is defective. Let me know your assessment so I can continue to work with NVIDIA support to return the unit. My project has been stopped because of this issue!
minicom.log (97.3 KB) xavier_dmesg.log (71.7 KB)

Did you run any application on it before the reboot happened?

Nomachine and maybe chrome.

The reboots are happening even without running these applications.

Could you share us your steps to flash board?

standard steps:
I connected PC and Xavier with USB C cable, connected a lAN connection to Xavier, launched SDK manager on PC, put Xavier in flash mode, and followed the steps …

Ok. Looks like need to RMA this one.