Xavier AGX boot failure: ASSERT

Hello,

I have been using an Xavier AGX 32G with a Connect Tech Rogue carrier card for going on a year with no issues. The system has been very stable and not exhibited any problems either booting or at run time. I upgraded to the latest Jetpack 5.0.2 Ubuntu 20.04 based OS about a month ago and have not experienced any problems. The system seems stable.

Xavier AGX 32G
Rogue AGX-101
Jetpack 5.0.2
ConnectTech JetPack 5.0.2 - L4T r35.1.0 BSP

I do regular apt update and apt upgrades, as well as reboots as part of my typical usage, again with no issues. Yesterday I did an apt update/upgrade followed by a reboot and the AGX did not boot again;I tried several power cycles with no success. After several more attempts I connected the debug UART and can see the boot loader attempting to boot the AGX, but failing with what I think is an ASSERT.

Note

I do have some serial devices connected to the 2 ttyTHS* uarts, but have also disabled the getty in the OS.\

crw-rw---- 1 root dialout 238, 0 Apr 15 18:25 /dev/ttyTHS0
crw-rw---- 1 root dialout 238, 1 Apr 15 18:25 /dev/ttyTHS1

I also removed the serial devices but the AGX still remains stuck in the boot loop.

I have some typical software installed on the AGX, CUDA etc, nothing out of the ordinary and no boot loader or kernel modifications. It’s actually a pretty stock configuration beyond what Connect Tech needs to do to get their BSP in place. Nothing has changed over the last month beyond the fresh Jetpack 5.0.2 install. If I power cycle the AGX enough times it can boot, I’d say maybe 1 out of 40 attempts it boots, the rest of the attempts it gets stuck it the boot loader loop. Once I either do a software reboot, or kill power, it is the same no boot behavior. The boot loader goes into a loop and tries for some number of times then gives up. I attached a keyboard to try and get into the boot loader menu but the AGX was not responsive to the keys at that point.

Has anyone else experienced this behavior on an AGX? I do not see any correlation between recent reboots or apt upgrades, but maybe I am missing something. I’ve ensured that the flash eMMC that contains the OS is not full, it has around 25% free space on it.

└─ $ ▶ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/mmcblk0p1   28G   21G  5.2G  81% /
none             16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           3.1G   18M  3.1G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme1n1p1  916G  2.8G  867G   1% /mnt/data2
/dev/nvme0n1p1  916G  210G  661G  25% /mnt/data
tmpfs           3.1G   16K  3.1G   1% /run/user/124
tmpfs           3.1G  8.0K  3.1G   1% /run/user/1000

For all intents the AGX just seems to have gotten into a bad state or something. But the strange thing is that it appears to be able to boot around once or so times in about 40 or 50 attempts.

Any input would be appreciated. Re-flashing the AGX is always an option, but not ideal. I have a lot of time invested in this image and would be concerned that the AGX will get into this same state again in the future. Since this is part of an autonomous system, that is not a good option.

Please see attached log.

thank you
20221214_agx_rogue_111_failure_to_boot.log (265.6 KB)

Hello,

I have been able to narrow down this issue, at least on one AGC module, to the serial devices connected to THS0 and THS1. The system boots OK if I remove the devices; however, I have been able to observe that the boot loader is detecting characters and thus disrupting the boot process. I can observe the AGX interacting with the characters from the serial devices. For example

This is the option     
one adjusts to change  
the language for the   
current system         
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais
\----------------------/
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                     
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                         
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                              
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                          
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                             
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais
\----------------------/ 
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                 
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                 
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais
\----------------------/          
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                              
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                    
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais\----------------------/                                                                                                                                           
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais
\----------------------/
/----------------------\||||||||Standard EnglishStandard Fran?aisEnglishFran?ais

Complete Entry    ^v=Move Highlight
 Esc=Exit Entry                                                                                                                            
^v=Move Highlight 
 <Enter>=Select Entry   
                                                                                   
<Standard English>            Select Language                                                                        
>Device Manager                                          
 >Boot Manager                                             
>Boot Maintenance Manager                                                                                         
 Continue                                                
 Reset  

If I understand correctly the first and second stage boot loaders MB1/MB2 do not look at the standard serial ports, that they only interact with the shared micro USB serial port. The only thing connected to this port is a workstation with minicom so I can observe the boot process. This issue exists even without minicom connected.
I thought the only requirement was to disable the getty on the operating system, which I have done. I have had this same configuration for a long time with the only difference being I upgraded to the Jetpack 5.0.2 Ubuntu 20.04 based. Is it possible that in this Jetpack release the boot loader stages need to have a console disabled? I’m read through the documentation and look at the configuration files and thus far only see mention on the shared micro USB uart.

Any input will be appreciated

Hi bruce4243,

From your serial console log, there’s an assertion which indicate invalid state occurs in UEFI.

.ASSERT [TerminalDxe] /dvs/git/dirty/git-master_linux/out/nvidia/bootloader/uefi/Jetson_RELEASE/edk2/MdeModulePkg/Universal/Console/TerminalDxe/TerminalConIn.c(2078):

It looks like that some messages send to the board through UART and interrupt the boot up process.

Could you try to set auto boot timeout to 0s to check if it could help?
Boot Maintenance Manager → Auto Boot Time-out → Change value to 0 → Press F10 to Save

Hello,

I was able to get into the Xavier AGX Boot Maintenance Manager and set the timeout to 0 seconds. This seems to have had a positive effect; however, the boot process does sometimes still see traffic on the serial ports and forces a reboot to occur.

Here is an example from the console output with the timeout set to 0 seconds, and the same serial devices connected that cause it to reboot during a power on. The full log is attached to this issue.

Jetson UEFI firmware (version 1.0-d7fb19b built on 2022-08-10T20:18:13-07:00)
Press ESCAPE for boot options 
**  WARNING: Test Key is used.  **
ASSERT [TerminalDxe] /dvs/git/dirty/git-master_linux/out/nvidia/bootloader/uefi/Jetson_RELEASE/edk2/MdeModulePkg/Universal/Console/TerminalDxe/TerminalConIn.c(2078): ((BOOLEAN)(0==1))

Resetting the system in 5 seconds.
ÿäÿâShutdown state requested 1
Rebooting system ...

With the new timeout set to 0 seconds I have not had the boot fail like it was before, but it does sometimes see traffic that causes it to need to reboot itself. It has only happen once or twice, but it does occur. This raises the question of is it possible to actually disable the serial console all together, not a timeout, but to disable this console such that the boot process is never interrupted at all? For the system I am building I need to ensure that the AGX boots every time, regardless of the traffic on the serial port. It cannot get stuck in the boot loop and possibly fail.
I have looked over the github page with the boot loader source code, but find it does not offer a lot of top level documentation on how the console is treated.
Can you provide some documentation on what and how the serial console is used during all boot stages? Also, if possible a way to completely disable the console during boot time. For example, on the TX2 which used UBoot, I simply rebuilt UBoot without console support enabled and this same issue went away.

thank you for your assistance
20230104_xavier_agx_boot_fail.log (20.9 KB)

If the reboot issue caused from UEFI still exists after set autoboot timeout to 0, you could try re-build UEFI to skip the following assertion as workaround.
https://github.com/NVIDIA/edk2/blob/main-edk2-stable202208/MdeModulePkg/Universal/Console/TerminalDxe/TerminalConIn.c#L2078

All debug messages would output to the combined UART
You could refer to the following link for more information.
Tegra Combined UART and the tcu_muxer Utility (nvidia.com)

You could refer to the following thread trying to disable combined UART, but we have not verified the use-case of disable combined uart.
Repurpose Debug UART for 'normal' comms on Xavier NX - #10

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.