Platform: Xavier NX (Dev Kit) w/ L4T R32.4.3
We are going to distribute Xavier NX devices in remote locations that need to gracefully handle any failures in boot which is why I am investigating the A / B booting solution. In experimenting with some failure scenarios I came across a scenario where the system halted indefinitely and required a manual power cycle to recover.
I have enabled A / B partitions with redundancy by making the following change to the smd_info.cfg before flashing
# slot info order is important!
# <priority> <suffix> <retry_count> <boot_successful>
#15 _a 7 1
#
# Config 2: Enable redundancy support (by removing comments ##)
#
< REDUNDANCY_USER 1 >
# slot info order is important!
# <priority> <suffix> <retry_count> <boot_successful>
15 _a 7 1
14 _b 7 1
Then after booting successfully on slot 0 (A), I forced an erase of the kernel on slot 1 (B) by formatting the partition. Then I set the next boot to use slot 1 (B). The subsequent boot indicated that Security fuse not burned, ignore validation failure
[0007.291] I> A/B: bin_type (37) slot 1
[0007.292] I> Loading kernel_b from partition
[0007.292] I> Loading partition kernel_b at 0xa43f0000 from device(0x6)
[0012.732] I> Validate kernel ...
[0012.733] I> T19x: Authenticate kernel (bin_type: 37), max size 0x5000000
[0012.733] E> Stage2Signature validation failed with SHA2!!
[0012.734] C> OEM authentication of kernel header failed!
[0012.734] W> Failed to validate kernel binary (err=1077936152)
[0012.734] W> Security fuse not burned, ignore validation failure
[0012.739] I> No kernel-dtb binary path
[0012.743] I> A/B: bin_type (38) slot 1
[0012.746] I> Loading kernel-dtb_b from partition
[0012.751] I> Loading partition kernel-dtb_b at 0x91000000 from device(0x6)
[0012.801] I> Validate kernel-dtb ...
[0012.802] I> T19x: Authenticate kernel-dtb (bin_type: 38), max size 0x400000
Then the system halted (as it attempted to boot from the bad kernel)
[0015.024] panic (caller 0xa0601238): die
[0015.028] HALT: spinning forever...
I then waited several minutes before manually power cycling the device (unplug and plug back in). It again booted to the same halted result. I power cycled again. Same result. Power cycled again. On the 3rd power-up the system switched back to slot 0 (A) and booted successfully.
The boot log and
[0007.292] I> A/B: bin_type (37) slot 0
[0007.292] I> Loading kernel from partition
[0007.293] I> Loading partition kernel at 0xa43f0000 from device(0x6)
[0012.734] I> Validate kernel ...
[0012.735] I> T19x: Authenticate kernel (bin_type: 37), max size 0x5000000
nvbootctrl dump-register
output after recovery
root@nx-sample-token:~# nvbootctrl dump-slots-info
magic:0x43424e00, version: 3 features: 3 num_slots: 2
slot: 0, priority: 15, suffix: _a, retry_count: 7, boot_successful: 1
slot: 1, priority: 14, suffix: _b, retry_count: 7, boot_successful: 1
These are my questions
-
I understand that if I blow security fuse and properly sign my images that the kernel image would not have passed the validation check. What would happen in this case? I assume I would not see the warning
ignore validation failure"
and that the system would indeed reboot right away instead of continuing to boot? -
I also assumed that the
Cboot
enabled watchdog would have caused for the system to reboot automatically when it was halted. I am rather sure that theCboot
watchdog is enabled as I have not made any changes to the bootloader config. Why didn’t the watchdog trigger the system to reboot? (Also which watchdog is enabled in Cboot? Tegra Watchdog or PMIC Watchdog?) (and is the “denver” WDT the same as the Tegra Watchdog?) -
This is related to my question (2) above… Looking at the “NX Software Features” in the L4T documentation it says for the “PMIC Watchdog” that “reboot from hang” is “disabled in ODMDATA by default”. How do we enable this? Do we want to? I looked but could not find documentation for
ODMDATA
and what all the bits represent.
Here is my ODMDATA for reference
$ LDK_DIR=/workspace/Linux_for_Tegra
$ source jetson-xavier-nx-devkit.conf
$ echo $ODMDATA
0xB8190000
- Why did it fall-back to slot 0 after the 3rd power cycle? I expected for it to take 7 (
retry_count
) cycles before recovery? Also, why didn’t it mark theboot_successful
flag as0
?