L4T 5.1 reboot loop after enabling watchdog with RootFS A/B

seeky15 · March 20, 2023, 8:27am

Hey team,

as @KevinFFF reponded here Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System the watchdog magically is not enabled anymore in 5.1.

Apart from that, everything should be working fine now. I’ve been assured, multiple times, for weeks.

Tried the following:

Patched the device tree to enable the watchdog
Flashed the system with rootfs A/B max retry count 3 and made sure that the watchdog is enabled now
Removed the whole filesystem content of the non-current slot with “rm -rf /*”
Switched the slot

Observed:
The system now reboots 120 seconds after the kernel panic. Endless, in a loop.

Expected:
After 3 tries it should boot into the other slot.

Come on! What is wrong with this software? Is there not a single part of the advertised A/B feature that’s actually working?

FYI @sanaurrehman @Max_Dichler

Max_Dichler · March 20, 2023, 8:47am

Hey @seeky15,

Thank you for referencing me!

I think solving the watchdog problem alone wouldn’t have solved the whole issue from the beginning, not? because the purpose of the watchdog is to restart the system after the kernel panic and we can actually do that manually by pressing the restart button without a watchdog, which will trigger the system to go into this restarting endless loop. I can confirm that this endless loop problem is happening also on my side.

Regards,
Max

KevinFFF · March 21, 2023, 7:20am

Hi all,

Watchdog disabled issue seems happened on Xavier NX devkit with SD module.
We still work on it and find the root cause, but watchdog and rootfs_A/B fail-over mechanism should work on Xavier NX devkit with eMMC with Jetpack 5.1(R35.2.1).

seeky15 · March 21, 2023, 7:33am

What you say makes no sense. Did you actually test that?

The device tree for the production module tegra194-p3668-0001-p3509-0000.dtb also has the watchdog disabled…

watchdog {
	wdt-boot-timeout = <0x20>;
	wdt-suspend-timeout = <0x78>;
	status = "disabled";
	phandle = <0x41b>;
};

Running the same test with the production module. Will report back.

KevinFFF · March 21, 2023, 7:37am

The watchdog should be enabled by overlay dtbo and odmdata. We’ve verified that it would work on Xavier NX devkit with eMMC. We are finding the difference between them.

seeky15 · March 21, 2023, 9:08am

Hey @KevinFFF

you’re right, on the emmc module the watchdog is enabled by default, despite the dtb saying the opposite.

Now let’s get back to the original topic.

With emmc, I get 3 kernel panics followed by a reboot. The system says it is switching the boot chain then.
As with the sd module the system after that starts to reboot endlessly without switching to the other slot.

After 11 reboots I stopped watching…That’s sure not how it is meant to be?

sanaurrehman · March 21, 2023, 10:04am

Sorry to hijack this thread, but I am facing the same issue as @seeky15 in Jetpack 5.1 on AGX Xavier. Therefore, it is probable that the root cause for this issue is the same in both boards.

I too flash A/B file system, then go to slotB, corrupt slotA, and then use nvbootctrl to boot slotA. The system goes to kernel panic (as it should), and then after three retries, it keeps rebooting endlessly without switching over to the backup slot.

KevinFFF · March 22, 2023, 3:06am

Could you help to provide the full serial console log for further check?
Please capture them from you corrupt the rootfs.

sanaurrehman · March 22, 2023, 5:37am

Here’s the log for AGX-Xavier.
AGX_Xavier_Serial_Console_Log.txt (428.6 KB)

I booted into slotA, corrupted slotB, then rebooted. The log shows what happens after the reboot command.

The same happens when I boot into slotB, corrupt slotA, then reboot. (That is, the same behavior is seen when going from slotA to slotB or from slotB to slotA).

Also, I tried corrupting the slot using dd command, and by removing entire rootfs. In all cases, I get the same outcome: The system retries three times, then goes into an endless rebooting cycle.

For reference (so that issue may be reproduced at your end), the dd commands used were:

sudo dd if=/dev/zero of=/dev/mmcblk0p1 bs=1024k seek=10 count=40 (For corrupting slotA from slotB)
sudo dd if=/dev/zero of=/dev/mmcblk0p2 bs=1024k seek=10 count=40 (For corrupting slotB from slotA)

KevinFFF · March 23, 2023, 3:11pm

Hi sanaurrehman,

Thanks for your log.
We’ve known the root cause of the “endless rebooting” issue. There’s a bug in UEFI, we’ve gotten a solution and it needs more verification and testing.
In this moment, your use case would not trigger it recover back to slot A.

seeky15 · March 23, 2023, 3:16pm

Hey @KevinFFF

I assume the same applies to Xavier NX with SD Card?

Let’s assume that you fix the bug in UEFI and the system will correctly set the not working slot to UNBOOTABLE.
I’d add a working rootfs to that slot afterwards then to recover it’s functionality. How will we be able to set the slot to BOOTABLE again? Will this automactially happen when using “nvbootctlr set-active-boot-slot”?

KevinFFF · March 23, 2023, 3:27pm

Xavier NX with SD Card has another watchdog issue, which could be fixed by this as quick workaround. We are still finding the cause. If you have above workaround on your Xavier NX with SD card, the situation from my previous reply could also be applied.

After fixing the corrupted partition, you could refer to the following steps to set the slot back to BOOTABLE in UEFI menu.

Press ESC to enter UEFI Menu, then choose Device Manager → NVIDIA Configuration 
→ L4T Configuration → OS chain B status → (The value is Unbootable if UEFI attemps recovery kernel) choose Normal 
→ Save and exit, reboot, UEFI will try Direct Boot.

seeky15 · March 23, 2023, 3:32pm

@KevinFFF

Thanks, I was aware of the watchdog issue.

We’ve been told multiple times that you could do it in UEFI.
That is not a feasible approach for an OTA update tough.

We can not expect our customers to disassemble the whole case to get to the serial header, let alone ask them to operate the uefi bios. That’s simply not possible. The other option would be to send a corrupted system back to the company. That’s does not make much sense either.

For 5.0.2 @JerryChang supplied a way to reset the UNBOOTABLE flag in the UEFI variables.
This seems to be gone in 5.1 but is of paramount importance. Please figure out a way to reset the UNBOOTABLE flag with a simple shell script from the second partition. Otherwise the whole A/B redundancy is not of any use at all.

KevinFFF · March 23, 2023, 3:48pm

User do not need to dissemble the case or use serial console to access UEFI menu. They could just use a keyboard and a monitor, just like the bios on the PC.

Above steps do the similar thing to reset the flag, that one should be more complicated due to multiple parameters should be considered. For JP5.1, we simplify the process and make it configurable in UEFI menu.

sanaurrehman · March 24, 2023, 4:34am

Thanks for the update @KevinFFF . By when can we expect the solution to be verified? As we are in a situation where we need to accelerate our development process. If it is a few days, then we can wait, but if more time is required for testing and verifying the solution, then we need to know so we can plan accordingly.

(Also, does this solution that you have mentioned work if we corrupt slot A while actually using slot A itself? Or will it work only for the use case where the corrupted slot is the unused slot?)

Also, like @seeky15 said, is there any way we can set the unbootable slot’s status back to bootable using command line? Our use case is similar to @seeky15 , where using the BIOS is not really a suitable option to reset the status of a slot (We would like to do this internally without having to use any monitor or any external input). If there is no way to do this using command line, will it be added as a feature in any future release of Jetpack?

seeky15 · March 24, 2023, 5:12am

I’ve never been able to access the UEFI without serial on the devkit or our custom board. Seems something else is not working there. But that’s another topic as we are not going to use it.

If you’re saying that there is no option besides UEFI, then we’ll have to look for other solutions. This is a big oversight.

We have been offering remote service for over a decade now. There is no option that we will ask the customer to “fix” a system which is broken. Over the AIR updates need to work without the user attaching a monitor/keyboard/mouse to a system.

Everytime the system is not responding for any reason you’d have to ask the customer to do this. Are you guys living behind the moon? Are your automotive customers entering uefi when this happens? I doubt…

If you clearly state that you will not offer any other way to set the slot bootable again I am quite sure our project will be stopped and the hardware we bought will be returned.

@WayneWWW @JerryChang
Tagging you guys as I am afraid there has been some missplanning in this Jetpack, maybe someone did not think of the actual use case of OTA, but if you really plan to go this way, please confirm it so we can stop the project before we waste even more time.

@madisox Can you comment on how meta-tegra is re enabling a UNBOOTABLE slot after 35.2.1? I doubt uefi is an option for your users either?

Steel01 · March 24, 2023, 7:16am

I spent some time the last couple weeks fighting with edk2 and a/b redundancy for the use case of android a/b boot control, which required digging pretty deeply into the code that controls all of this. I might know what’s going on for this specific problem.

The retry count is stored in a volatile scratch register. When the unit is cold booted, the retry count gets initialized to 3. Note that this only happens on cold boot, any warm boot including an rcm reflash from a reset will not reset this scratch register. When the efi launcher app is run and a non-recovery target is being booted, the retry count is decremented. If the unit boots successfully, the boot control app (nvbootctrl in the case of l4t) will set the retry count for the current slot back to 3 via a devmem write. If the count is already 0 when the efi launcher app is run, then RootfsStatusSlot in efivars is set to unbootable or 0xff for the current slot and also BootChainFwNext in efivars is set to the opposite slot, then it resets the unit.

Now, the BootChainDXE module in edk2-nvidia is supposed to check BootChainFwNext and appropriately change the boot slot. However, there’s a bug in this module. If the BootChainFwStatus efivar exists at all, the slot change will fail. This efivar gets set for a few different reasons within the module, mainly for errors. But it is also set to 0x0 after a successful slot change. So, after one slot change, no slot change can happen again unless the var is deleted. And if the slot change cannot happen, plus the rootfs status is unbootable, the unit will reboot endlessly. I have a patch to fix this specific issue, applies to the edk2-nvidia repo on the r35.2.1 tag.

From d750afa2125f5b837195ed89412ed253c8e10c4e Mon Sep 17 00:00:00 2001
From: Aaron Kling <webgeek1234@gmail.com>
Date: Sun, 12 Mar 2023 21:45:41 -0500
Subject: [PATCH] BootChainDxe: Don't fail boot chain updates for status simply
 existing

---
 Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c b/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
index cfaf687..d7eaf0f 100644
--- a/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
+++ b/Silicon/NVIDIA/Drivers/BootChainDxe/BootChainDxe.c
@@ -336,8 +336,8 @@ BootChainExecuteUpdate (
         BCStatus = STATUS_ERROR_BOOT_CHAIN_FAILED;
         goto SetStatusAndBootOs;
       }
-    } else {
-      // Status is already ERROR or SUCCESS, finish the update and boot OS
+    } else if (BCStatus != STATUS_SUCCESS) {
+      // Status is already ERROR, finish the update and boot OS
       goto FinishUpdateAndBootOs;
     }
   }
-- 
2.39.2

Max_Dichler · July 4, 2023, 2:14pm

Dear Kevin,

I have tested the fallback mechanism on the new Jetpack 5.1.1. Unfortunately, UEFI goes into endless rebooting behavior. Can you please investigate this problem in this latest Jetpack release 5.1.1? When will the problem be solved? @KevinFFF

Max_Dichler · July 4, 2023, 2:17pm

Dear @sanaurrehman,

Can you please elaborate on why you skipped the first 10K by using (seek=10) while corrupting mmcblk0p1? instead of corrupting mmcblk0p1 from the beginning. What do these first 10k bytes contain?

KevinFFF · July 5, 2023, 3:23am

It should be fixed in the upcoming JP5.1.2 release.

Topic		Replies	Views
Rootfs A/B redundancy fail-over mechanism in Jetpack5.1 Jetson Xavier NX kb	13	4757	March 18, 2024
A/B ROOTFS Redundancy: Bootloader does not boot from backup slot when the working slot is intentionally corrupted Jetson AGX Xavier security , nvbugs	15	1956	March 24, 2023
Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System Jetson Xavier NX boot , nvbugs	21	2363	March 20, 2023
Need Help in Understanding Failover in RootFS A/B redundancy Jetson AGX Xavier security	13	1632	October 7, 2024
Mark bootloader bootable Jetson Xavier NX security	27	2266	November 29, 2022
Setting bootable and unbootable AB rootFS slots for L4T r35.1 Jetson AGX Xavier security	37	3441	April 25, 2023
Redundant A/B rootfs not switching with set-active-boot-slot but working with set-SR-BR Jetson AGX Xavier ota	33	204	October 2, 2024
L4T34.1.1 UEFI boot error on Xavier NX Jetson Xavier NX boot	19	1878	August 3, 2022
Jetpack 5.1 needs clarification from NVIDIA! Jetson Xavier NX boot , nvbugs	5	1060	April 25, 2023
Jetson Xavier Nx Not Booting Jetson Xavier NX boot	56	14021	October 18, 2021

L4T 5.1 reboot loop after enabling watchdog with RootFS A/B

Related topics