Failed Bootloader Watchdog Recovery

joseph.swantek · September 15, 2020, 2:58am

Platform: Xavier NX (Dev Kit) w/ L4T R32.4.3

We are going to distribute Xavier NX devices in remote locations that need to gracefully handle any failures in boot which is why I am investigating the A / B booting solution. In experimenting with some failure scenarios I came across a scenario where the system halted indefinitely and required a manual power cycle to recover.

I have enabled A / B partitions with redundancy by making the following change to the smd_info.cfg before flashing

# slot info order is important!
# <priority>    <suffix>     <retry_count>  <boot_successful>
#15                  _a          7               1

#
# Config 2: Enable redundancy support (by removing comments ##)
#
< REDUNDANCY_USER 1 >

# slot info order is important!
# <priority>    <suffix>     <retry_count>  <boot_successful>
15                  _a          7               1
14                  _b          7               1

Then after booting successfully on slot 0 (A), I forced an erase of the kernel on slot 1 (B) by formatting the partition. Then I set the next boot to use slot 1 (B). The subsequent boot indicated that Security fuse not burned, ignore validation failure

[0007.291] I> A/B: bin_type (37) slot 1
[0007.292] I> Loading kernel_b from partition
[0007.292] I> Loading partition kernel_b at 0xa43f0000 from device(0x6)
[0012.732] I> Validate kernel ...
[0012.733] I> T19x: Authenticate kernel (bin_type: 37), max size 0x5000000
[0012.733] E> Stage2Signature validation failed with SHA2!!
[0012.734] C> OEM authentication of kernel header failed!
[0012.734] W> Failed to validate kernel binary (err=1077936152)
[0012.734] W> Security fuse not burned, ignore validation failure 
[0012.739] I> No kernel-dtb binary path
[0012.743] I> A/B: bin_type (38) slot 1
[0012.746] I> Loading kernel-dtb_b from partition
[0012.751] I> Loading partition kernel-dtb_b at 0x91000000 from device(0x6)
[0012.801] I> Validate kernel-dtb ...
[0012.802] I> T19x: Authenticate kernel-dtb (bin_type: 38), max size 0x400000

Then the system halted (as it attempted to boot from the bad kernel)

[0015.024] panic (caller 0xa0601238): die
[0015.028] HALT: spinning forever...

I then waited several minutes before manually power cycling the device (unplug and plug back in). It again booted to the same halted result. I power cycled again. Same result. Power cycled again. On the 3rd power-up the system switched back to slot 0 (A) and booted successfully.

The boot log and

[0007.292] I> A/B: bin_type (37) slot 0
[0007.292] I> Loading kernel from partition
[0007.293] I> Loading partition kernel at 0xa43f0000 from device(0x6)
[0012.734] I> Validate kernel ...
[0012.735] I> T19x: Authenticate kernel (bin_type: 37), max size 0x5000000

nvbootctrl dump-register output after recovery

root@nx-sample-token:~# nvbootctrl dump-slots-info
magic:0x43424e00,             version: 3             features: 3             num_slots: 2
slot: 0,             priority: 15,             suffix: _a,             retry_count: 7,             boot_successful: 1
slot: 1,             priority: 14,             suffix: _b,             retry_count: 7,             boot_successful: 1

These are my questions

I understand that if I blow security fuse and properly sign my images that the kernel image would not have passed the validation check. What would happen in this case? I assume I would not see the warning ignore validation failure"and that the system would indeed reboot right away instead of continuing to boot?
I also assumed that the Cboot enabled watchdog would have caused for the system to reboot automatically when it was halted. I am rather sure that the Cboot watchdog is enabled as I have not made any changes to the bootloader config. Why didn’t the watchdog trigger the system to reboot? (Also which watchdog is enabled in Cboot? Tegra Watchdog or PMIC Watchdog?) (and is the “denver” WDT the same as the Tegra Watchdog?)
This is related to my question (2) above… Looking at the “NX Software Features” in the L4T documentation it says for the “PMIC Watchdog” that “reboot from hang” is “disabled in ODMDATA by default”. How do we enable this? Do we want to? I looked but could not find documentation for ODMDATA and what all the bits represent.

Here is my ODMDATA for reference

$ LDK_DIR=/workspace/Linux_for_Tegra
$ source jetson-xavier-nx-devkit.conf
$ echo $ODMDATA
0xB8190000

Why did it fall-back to slot 0 after the 3rd power cycle? I expected for it to take 7 (retry_count) cycles before recovery? Also, why didn’t it mark the boot_successful flag as 0?

WayneWWW · September 23, 2020, 8:11am

Hi,

Please try to add this patch to the cboot source and it will enhance the time you have to wait for WDT to trigger.
If you don’t add this patch, you’ll have to wait for 8~9mins for WDT to trigger. Manually reset will not make system enable A/B switch.

diff --git a/cboot/app/kernel_boot/kernel_boot.c b/cboot/app/kernel_boot/kernel_boot.c
index d04188c..79bd566 100644
--- a/cboot/app/kernel_boot/kernel_boot.c
+++ b/cboot/app/kernel_boot/kernel_boot.c
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2016-2019, NVIDIA Corporation.	All Rights Reserved.
+ * Copyright (c) 2016-2020, NVIDIA Corporation.	All Rights Reserved.
  *
  * NVIDIA Corporation and its licensors retain all intellectual property and
  * proprietary rights in and to this software and related documentation.  Any
@@ -179,10 +179,19 @@
 #endif
 
 	err = tegrabl_load_kernel_and_dtb(kernel, &kernel_entry_point,
-									  &kernel_dtb, &callbacks, NULL, 0);
+						  &kernel_dtb, &callbacks, NULL, 0);
+
+	/*
+	 * Update smd if a/b retry counter changed
+	 * The slot priorities are rotated here too,
+	 * in case kernel or kernel-dtb load failed.
+	*/
+	tegrabl_a_b_update_smd();
+
 	if (err != TEGRABL_NO_ERROR) {
 		TEGRABL_SET_HIGHEST_MODULE(err);
-		pr_error("kernel boot failed\n");
+		pr_error("kernel boot failed, will reset.\n");
+		tegrabl_reset();
 		return err;
 	}
 
@@ -200,9 +209,6 @@
 	tegrabl_profiler_record("kernel_boot exit", 0, DETAILED);
 #endif
 
-	/* Update smd if a/b retry counter changed */
-	tegrabl_a_b_update_smd();
-
 	pr_info("Kernel EP: %p, DTB: %p\n", kernel_entry_point, kernel_dtb);
 
 	platform_uninit();

WayneWWW · September 23, 2020, 8:29am

As for the 3 times retry. According to our internal team, only bootloader will switch after 7 times. For kernel and kernel-dtb, it will only take 3 times.