We are stress testing some AGX Orin by power cycling them repeatedly without warning. We realize this is not a friendly thing to do, but we are required to do it in our application. After a few hundred cycles (more than 200 less than 1000 cycles) the machines will eventually stop booting.
The serial console returns the following error message:
[2023-02-12 13:35:10.201831] ESC to enter Setup.
[2023-02-12 13:35:10.201849] F11 to enter Boot Manager Menu.
[2023-02-12 13:35:10.201867] Enter to continue boot.
[2023-02-12 13:35:10.201884] **********************************
[2023-02-12 13:35:10.217694] ** WARNING: Test Key is used. **
[2023-02-12 13:35:10.217801] **********************************
[2023-02-12 13:35:10.217824] ** WARNING: Test Key is used. **
[2023-02-12 13:35:15.209196] ......PROGRESS CODE: V03051007 I0
[2023-02-12 13:35:15.513211] <FF><E4>
[2023-02-12 13:35:15.529519] ASSERT [VariableStandaloneMm] /dvs/git/dirty/git-master_linux/out/nvidia/optee.t234/uefi/StandaloneMmOptee_RELEASE/edk2/MdeModulePkg/Universal/Variable/RuntimeDxe/Variable.c(3255): !(((INTN)(RETURN_STATUS)(Status)) < 0)
We have managed to hit ESC and enter setup and wandered around in the TUI, but we have been unable to get the system to boot.
Can anyone help us identify the error and avoid this failure mode in the future?