The AGX Xavier keeps rebooting automatically and then goes into recovery mode

The device has been working fine, after a particular power up. The device keeps restarting automatically, each time on the following print:

[    1.879915] gpiochip0: registered GPIOs 504 to 511 on max77620-gpio
[    2.018647] max77686-rtc max77620-rtc: registered as rtc0
[    2.019540] max77620-power max20024-power: Event recorder REG_NVERC : 0x0
[    2.019827] max77620 4-003c: max77620 probe successful
debugfs initialized
[    2.854499] tegra-se-elp 3ad0000.se_elp: tegra_se_elp_probe: complete
[    2.855547] hid: raw HID events driver (C) Jiri Kosina
[    2.856238] usbcore: registered new interface driver usbhid
[    2.856379] usbhid: USB HID core driver
[    2.858548] tegra186-cam-rtcpu bc00000.rtcpu: Adding to iommu group 4
[    2.859596] tegra186-cam-rtcpu bc00000.rtcpu: Trace buffer configured at IOVA=0xbff00000
[    22.503665] Camera-FW on t194-rce-safe started
TCU early console enabled.
[    22.578454] Camera-FW on t194-rce-safe ready SHA1=9e9c1f28 (crt 0.775 ms, total boot 75.593 ms)
[    2.945178] tegra-ivc-bus bc00000.rtcpu:ivc-bus: region 0: iova=0xbfec0000-0xbfee01ff size=131584
[    2.945783] tegra-ivc-bus bc00000.rtcpu:ivc-bus:echo@0: echo: ver=0 grp=1 RX[16x64]=0x1000-0x1480 TX[16x64]=0x1480-0x1900
[    2.947027] tegra-ivc-bus bc00000.rtcpu:ivc-bus:dbg@1: dbg: ver=0 grp=1 RX[1x448]=0x1900-0x1b40 TX[1x448]=0x1b40-0x1d80
[    2.948285] tegra-ivc-bus bc00000.rtcpu:ivc-bus:dbg@2: dbg: ver=0 grp=1 RX[1x8192]=0x1d80-0x3e00 TX[1x8192]=0x3e00-0x5
[0000.060] W> RATCHET: MB1 binary ratchet value 4 is larger than ratchet level 2 from HW fuses.
[0000.068] I> MB1 (prd-version: 2.6.0.0-t194-41334769-cab45716)
[0000.073] I> Boot-mode: Coldboot
[0000.076] I> Platform: Silicon

It seems have a problem with tegra-ivc-bus? Auto reboot after tegra-ivc-bus prints every time.

After several reboot, the device comes into Recovery Mode:

Jetson UEFI firmware (version 6.0-37391689 built on 2024-08-28T08:47:11+00:00)
ESC   to enter Setup.
F11   to enter Boot Manager Menu.
Enter to continue boot.
**  WARNING: Test Key is used.  **
......ASSERT [VariableRuntimeDxe] /out/nvidia/bootloader/uefi/Jetson_RELEASE/edk2/MdeModulePkg/Universal/Variable/RuntimeDxe/Variable.c(3264): !(((INTN)(RETURN_STATUS)(Status)) < 0)

L4TLauncher: Attempting Recovery Boot
EFI stub: Booting Linux Kernel...
EFI stub: Using DTB from configuration table
EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Exiting boot services and installing virtual address map...
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x4e0f0040]

What could be the reason for the constant rebooting of the above DEVICE?

The following is a detailed serial printout of how this phenomenon occurs:
[com COM4] (2025-02-26_113008) COM4 (Prolific PL2303GT USB Serial COM Port (COM4))(1) (1).log (12.3 MB)

Please help me!

Hi,

Please try to share the full log before recovery boot happened.

Recovery boot will happen if your system fails to boot up multiple times. Thus, we need to check why it keeps rebooting but not recovery boot itself.

Hello, please see the attachment above for detailed serial port printing. It records all the prints that go into this exception.

Hi,

我直接用中文說明可能比較清楚. 我們的意思是請你提供"recovery boot" 發生之前前幾次開機失敗完整的log.

你附上的東西已經在recovery boot了 這份log全部人給的都長一樣. 因為recovery boot是我們提供的一個範本initrd. 沒有確認的必要.
你的log裡面有353次開機的紀錄也沒有幫助, 因為這些全部都是recovery boot.


The log you provided does not help. They are all recovery boot log.

We need you to provide the full log “before” recovery boot happened. Recovery boot image is a template initrd provided by us. Everyone’s recovery image is same so no need to check…

非常抱歉,在进入recovery Boot前我们没有接上串口。
我们会一直接上串口,尝试复现这个问题,然后将 “recovery boot” 發生之前前幾次 開機失敗完整的log 提供给您。

hello,
我们复现了这个问题,抓到了"recovery boot" 發生之前的前幾次開機失敗完整的log.
我们使用AGX Xavier模组,L4T版本为R35.6.0;采用A/B Redundancy方案,有slot A和slot B两个slot;使用自己设计的定制载板,非官方载板。

2025-03-26 10:00:00 到 2025-03-26 14:13:17,我们在正常使用过程中对设备进行了9次关机、启动操作,均可正常启动系统,系统功能正常。

未做其他异常操作,2025-03-26 14:43:02,我们在对设备进行第10次关机,然后重启操作时,发现设备系统启动到如下打印位置就一直自动重启:

        [2025-03-26 14:43:26]  [    3.667446] tegra186-cam-rtcpu bc00000.rtcpu: Adding to iommu group 4
[2025-03-26 14:43:26]  [    3.668461] tegra186-cam-rtcpu bc00000.rtcpu: Trace buffer configured at IOVA=0xbff00000
[2025-03-26 14:43:26]  [    23.760325] Camera-FW on t194-rce-safe started
[2025-03-26 14:43:26]  TCU early console enabled.
[2025-03-26 14:43:26]  [    23.835581] Camera-FW on t194-rce-safe ready SHA1=9e9c1f28 (crt 0.775 ms, total boot 76.061 ms)
[2025-03-26 14:43:26]  [    3.754088] tegra-ivc-bus bc00000.rtcpu:ivc-bus: region 0: iova=0xbfec0000-0xbfee01ff size=131584
[2025-03-26 14:43:26]  [    3.754659] tegra-ivc-bus bc00000.rtcpu:ivc-bus:echo@0: echo: ver=0 grp=1 RX[16x64]=0x1000-0x1480 TX[16x64]=0x1480-0x1900
[2025-03-26 14:43:26]  [    3.755551] tegra-ivc-bus bc00000.rtcpu:ivc-bus:dbg@1: dbg: ver=0 grp=1 RX[1x448]=0x1900-0x1b40 TX[1x448]=0x1b40-0x1d80
[2025-03-26 14:43:26]  [    3.756406] tegra-ivc-bus bc00000.rtcpu:ivc-bus:dbg@2: dbg: ver=0 grp=1 RX[1x8192]=0x1d80-0x3e00 TX[1x8192

自动重启5次后。Active Boot chain切换到1。在slot B中,内核启动到上述位置依旧自动重启。整个过程的详细串口打印见附件1。
最终设备进入recovery模式。进recovery模式后的串口打印见附件2。
重新flash设备的BSP后,后续设备又可以正常启动。
此异常场景我们在产品中已出现多次,前几次均未拿到有效串口打印,但是现象和这次一样。
Jetson AGX Xavier不断自动重启.log (4.3 MB)
Jetson AGX Xavier进入recovery模式后的打印.log (105.0 KB)

Hi @newbie.lei

感謝提供log. 能否請問如果單做reboot 10次的話能不能輕鬆複製出這問題? 還是說要跑特定的application才能複製出來?
從上面提供的log看來好像並沒有任何kernel crash的通知系統就直接重開了.

從發生問題開始, 從reset source來看, 每次重開機的原因都是由於PVA WDT.

[2025-03-26 14:44:13]  [    0.895307] tegra-pmc: ### PMC reset source: PVA0WDT
[2025-03-26 14:44:13]  [    0.895325] tegra-pmc: ### PMC reset level: L1
[2025-03-26 14:44:13]  [    0.895338] tegra-pmc: ### PMC reset status reg: 0x41

您好,
我们也不清楚具体的复现方法,我们这边有几十台机器,只是偶尔会有一台发生这种异常。我这边也在做reboot试验,但是做了两周,也没有复现。感觉复现概率并不高,似乎也没有和特定的APP相关。
对,我们也发现没有Oops就直接重启了,也没有可以着手排查的方向,所以向您请一下。

请问 PMC reset source: PVA0WDT 具体是什么意思呢?什么时候会触发这个场景?通过这个信息可以定位到大概的方向吗?
如果您有一些猜测,需要我们做验证的话,也请告知我们。

Hi,
请问您有什么建议的排查方向吗?
如果您有猜测需要我们验证,也请及时告知我们一下。
非常感谢!

請問你們的usecase跑起來之後tegrastats上會顯示PVA active的狀態嗎?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.