L4T 5.1 reboot loop after enabling watchdog with RootFS A/B

The watchdog should be enabled by overlay dtbo and odmdata. We’ve verified that it would work on Xavier NX devkit with eMMC. We are finding the difference between them.

Hey @KevinFFF

you’re right, on the emmc module the watchdog is enabled by default, despite the dtb saying the opposite.

Now let’s get back to the original topic.

With emmc, I get 3 kernel panics followed by a reboot. The system says it is switching the boot chain then.
As with the sd module the system after that starts to reboot endlessly without switching to the other slot.

After 11 reboots I stopped watching…That’s sure not how it is meant to be?

Sorry to hijack this thread, but I am facing the same issue as @seeky15 in Jetpack 5.1 on AGX Xavier. Therefore, it is probable that the root cause for this issue is the same in both boards.

I too flash A/B file system, then go to slotB, corrupt slotA, and then use nvbootctrl to boot slotA. The system goes to kernel panic (as it should), and then after three retries, it keeps rebooting endlessly without switching over to the backup slot.

Could you help to provide the full serial console log for further check?
Please capture them from you corrupt the rootfs.

Here’s the log for AGX-Xavier.
AGX_Xavier_Serial_Console_Log.txt (428.6 KB)

I booted into slotA, corrupted slotB, then rebooted. The log shows what happens after the reboot command.

The same happens when I boot into slotB, corrupt slotA, then reboot. (That is, the same behavior is seen when going from slotA to slotB or from slotB to slotA).

Also, I tried corrupting the slot using dd command, and by removing entire rootfs. In all cases, I get the same outcome: The system retries three times, then goes into an endless rebooting cycle.

For reference (so that issue may be reproduced at your end), the dd commands used were:

sudo dd if=/dev/zero of=/dev/mmcblk0p1 bs=1024k seek=10 count=40 (For corrupting slotA from slotB)
sudo dd if=/dev/zero of=/dev/mmcblk0p2 bs=1024k seek=10 count=40 (For corrupting slotB from slotA)

Hi sanaurrehman,

Thanks for your log.
We’ve known the root cause of the “endless rebooting” issue. There’s a bug in UEFI, we’ve gotten a solution and it needs more verification and testing.
In this moment, your use case would not trigger it recover back to slot A.

Hey @KevinFFF

I assume the same applies to Xavier NX with SD Card?

Let’s assume that you fix the bug in UEFI and the system will correctly set the not working slot to UNBOOTABLE.
I’d add a working rootfs to that slot afterwards then to recover it’s functionality. How will we be able to set the slot to BOOTABLE again? Will this automactially happen when using “nvbootctlr set-active-boot-slot”?

Xavier NX with SD Card has another watchdog issue, which could be fixed by this as quick workaround. We are still finding the cause. If you have above workaround on your Xavier NX with SD card, the situation from my previous reply could also be applied.

After fixing the corrupted partition, you could refer to the following steps to set the slot back to BOOTABLE in UEFI menu.

Press ESC to enter UEFI Menu, then choose Device Manager → NVIDIA Configuration 
→ L4T Configuration → OS chain B status → (The value is Unbootable if UEFI attemps recovery kernel) choose Normal 
→ Save and exit, reboot, UEFI will try Direct Boot.

@KevinFFF

Thanks, I was aware of the watchdog issue.

We’ve been told multiple times that you could do it in UEFI.
That is not a feasible approach for an OTA update tough.

We can not expect our customers to disassemble the whole case to get to the serial header, let alone ask them to operate the uefi bios. That’s simply not possible. The other option would be to send a corrupted system back to the company. That’s does not make much sense either.

For 5.0.2 @JerryChang supplied a way to reset the UNBOOTABLE flag in the UEFI variables.
This seems to be gone in 5.1 but is of paramount importance. Please figure out a way to reset the UNBOOTABLE flag with a simple shell script from the second partition. Otherwise the whole A/B redundancy is not of any use at all.

User do not need to dissemble the case or use serial console to access UEFI menu. They could just use a keyboard and a monitor, just like the bios on the PC.

Above steps do the similar thing to reset the flag, that one should be more complicated due to multiple parameters should be considered. For JP5.1, we simplify the process and make it configurable in UEFI menu.

Thanks for the update @KevinFFF . By when can we expect the solution to be verified? As we are in a situation where we need to accelerate our development process. If it is a few days, then we can wait, but if more time is required for testing and verifying the solution, then we need to know so we can plan accordingly.

(Also, does this solution that you have mentioned work if we corrupt slot A while actually using slot A itself? Or will it work only for the use case where the corrupted slot is the unused slot?)

Also, like @seeky15 said, is there any way we can set the unbootable slot’s status back to bootable using command line? Our use case is similar to @seeky15 , where using the BIOS is not really a suitable option to reset the status of a slot (We would like to do this internally without having to use any monitor or any external input). If there is no way to do this using command line, will it be added as a feature in any future release of Jetpack?

I’ve never been able to access the UEFI without serial on the devkit or our custom board. Seems something else is not working there. But that’s another topic as we are not going to use it.

If you’re saying that there is no option besides UEFI, then we’ll have to look for other solutions. This is a big oversight.

We have been offering remote service for over a decade now. There is no option that we will ask the customer to “fix” a system which is broken. Over the AIR updates need to work without the user attaching a monitor/keyboard/mouse to a system.

Everytime the system is not responding for any reason you’d have to ask the customer to do this. Are you guys living behind the moon? Are your automotive customers entering uefi when this happens? I doubt…

If you clearly state that you will not offer any other way to set the slot bootable again I am quite sure our project will be stopped and the hardware we bought will be returned.

@WayneWWW @JerryChang
Tagging you guys as I am afraid there has been some missplanning in this Jetpack, maybe someone did not think of the actual use case of OTA, but if you really plan to go this way, please confirm it so we can stop the project before we waste even more time.

@madisox Can you comment on how meta-tegra is re enabling a UNBOOTABLE slot after 35.2.1? I doubt uefi is an option for your users either?

I spent some time the last couple weeks fighting with edk2 and a/b redundancy for the use case of android a/b boot control, which required digging pretty deeply into the code that controls all of this. I might know what’s going on for this specific problem.

The retry count is stored in a volatile scratch register. When the unit is cold booted, the retry count gets initialized to 3. Note that this only happens on cold boot, any warm boot including an rcm reflash from a reset will not reset this scratch register. When the efi launcher app is run and a non-recovery target is being booted, the retry count is decremented. If the unit boots successfully, the boot control app (nvbootctrl in the case of l4t) will set the retry count for the current slot back to 3 via a devmem write. If the count is already 0 when the efi launcher app is run, then RootfsStatusSlot in efivars is set to unbootable or 0xff for the current slot and also BootChainFwNext in efivars is set to the opposite slot, then it resets the unit.

Now, the BootChainDXE module in edk2-nvidia is supposed to check BootChainFwNext and appropriately change the boot slot. However, there’s a bug in this module. If the BootChainFwStatus efivar exists at all, the slot change will fail. This efivar gets set for a few different reasons within the module, mainly for errors. But it is also set to 0x0 after a successful slot change. So, after one slot change, no slot change can happen again unless the var is deleted. And if the slot change cannot happen, plus the rootfs status is unbootable, the unit will reboot endlessly. I have a patch to fix this specific issue, applies to the edk2-nvidia repo on the r35.2.1 tag.

From d750afa2125f5b837195ed89412ed253c8e10c4e Mon Sep 17 00:00:00 2001
From: Aaron Kling <webgeek1234@gmail.com>
Date: Sun, 12 Mar 2023 21:45:41 -0500
Subject: [PATCH] BootChainDxe: Don't fail boot chain updates for status simply
 existing

---
 Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c b/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
index cfaf687..d7eaf0f 100644
--- a/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
+++ b/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
@@ -336,8 +336,8 @@ BootChainExecuteUpdate (
         BCStatus = STATUS_ERROR_BOOT_CHAIN_FAILED;
         goto SetStatusAndBootOs;
       }
-    } else {
-      // Status is already ERROR or SUCCESS, finish the update and boot OS
+    } else if (BCStatus != STATUS_SUCCESS) {
+      // Status is already ERROR, finish the update and boot OS
       goto FinishUpdateAndBootOs;
     }
   }
-- 
2.39.2
3 Likes

Dear Kevin,

I have tested the fallback mechanism on the new Jetpack 5.1.1. Unfortunately, UEFI goes into endless rebooting behavior. Can you please investigate this problem in this latest Jetpack release 5.1.1? When will the problem be solved? @KevinFFF

Dear @sanaurrehman,

Can you please elaborate on why you skipped the first 10K by using (seek=10) while corrupting mmcblk0p1? instead of corrupting mmcblk0p1 from the beginning. What do these first 10k bytes contain?

It should be fixed in the upcoming JP5.1.2 release.

1 Like

Hi @Max_Dichler ,

There were two reasons why I did this:

  1. In a real system, the nature of data corruption is random. The data could be corrupted from any point onwards, not necessarily at the beginning of any partition. I wanted to simulate that scenario by choosing a random starting point for corrupting the partition.
  2. Initially, the rootfs feature was not working as expected. Therefore, as a experiment (to validate whether my test procedure was right or not), I was told by someone (who has experience in Linux based systems) to skip the first few bytes, and see whether that made any difference. The number 10k does not have any significance. It was chosen randomly just to skip first few bytes of the partition. (I dont remember exactly, but I think it was something to do with the Linux journal).
1 Like

Thanks for the update.

Thanks for the info

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.