How to detect at boot time when boot slot become unbootable

Hello,

I am facing an issue where after many reboot in the 100’ 1000’, there is a swap to the other slot (so ROOTFS_AB=1), and I can see in nvbootctrl that the former slot is now unbootable.

However, I would like to get the underlined reasons of this phenomenon in order to narrow it down. Is it possible to see some traces in the early stage, eg. mb1/mb2? Is the reason store somewhere that can be read-back at a later time?

Is it possible to provide mb1/mb2 binaries with more logs enabled? Or maybe via some option in the DTS?

This is for 35.6.2

Thanks

hello sebastien.schertenleib,

may I know what’s your test steps?
for instance,
did you keep at slot-A to do reboot stress test (may I also know detail steps/commands), you’ll see slot-A crashed, and then fall back to slot-B after 1000 reboot cycles?

BTW, please see-also Topic 315628 for the patches to fix UEFI assertion issues.

Hi,

Yes, I am doing either cold or warm reboot, saying on let’s say slot A. After the system is up for ~1mn, a script ask for a reboot or an external device cut the power. After a while, which can be 100’ or 1000’ reboot cycles, suddenly, the system boot on slot B. If I carry on, it will again after, a while, go back to slot A.

I want to be able to understand where it fails in the boot chain mb1, mb2… and ideally what was the reason for the change of slot, since slot A after the phenomenon is still usable (by waiting for another boot slot swap or via nvbootctrl).

The logs from mb1 and mb2 do not seems to provide enough information. Is there a way to have custom binaries? We can contact Nvidia via our contact in our area in the same way we did for the FSKP, as I can understand you might not want to upload them here.

Can you ask internals if such custom build might be possible?

In the meantime, I will check your link for those assertion issues.

Thanks.

hello sebastien.schertenleib,

this is cold reboot, it’s same as using hardware reset button to restart the system.

since you’re having power cutoff within a minute, it might be a timing issue that background service has not complete before system shutdown.

you may refer to developer guide, Rootfs Selection.
please add some check of these two background services l4t-rootfs-validation-config.service and nv-l4tbootloader-config.service before cutting of the power for verification.

Thanks for the head-up. I will address the proposal for those uefi assertions and I will have a look on those services.

Howerver, warm reset should be fine or are you suggesting that some of those services might not always shut down gracefully?

hello sebastien.schertenleib,

yap, we’ve experience some file system issue with power-on shortly power-off.

Hello,

looking into nv-l4tbootloader-config.service and its associated script that at the end if call the following:

If verify_boot_status is set, call nvbootctrl to verify the boot status.

if [ “${verify_boot_status}” = “1” ]; then
echo “Info. Verifying boot status.”
nvbootctrl verify
fi

This trigger the following:
Info: variable BootChainFwStatus is not found.

What do I need to to have this variable set? Is it critical and supported with 35.6.2?

Also as you mention file system issue with power-on shortly power-off. Could it be if power off is done before the following is executed?

modprobe -r mtdblock

Thanks

hello sebastien.schertenleib,

you should try to cutoff the power by checking sudo systemctl status l4t-rootfs-validation-config.service.
it should wait till status=0/SUCCESS for system shutting down.
for instance,

    Process: 415 ExecStart=/opt/nvidia/l4t-rootfs-validation-config/l4t-rootfs-validation-config.sh (code=exited, status=0/SUCCESS)
   Main PID: 415 (code=exited, status=0/SUCCESS)

Hello,

Thanks for the suggestion. I will prepare a new build with the fixes for the uefi assertion and making sure I wait until the status=0/SUCCESS. As this phenomenon is random, I will let the devkit run over the week-end.

Is this still an issue to support? Any result can be shared?

Hello,

We are still investigating. It would be very useful to have mb1, mb2 binaries with more verbose mode. Should we contact directly our NVIDIA representative for this request or is it something that can be sorted out in this channel?

Thanks.