ROOTFS_AB enable, but cannot reboot when A boot fail!

I use the 35.3.1 for jetson orin nx 16GB, the ROOTFS_AB works well.
While I use 36.3.0, it fail to reboot when I remove the /usr/sbin/init /usr/bin/systemd to let the rootfs broken. The boot process will stop, and cannot reboot to retry boot rootfs_A.

the cmd list is:

sudo tar xpf ${SAMPLE_FS_PACKAGE} -C Linux_for_Tegra/rootfs/
cd Linux_for_Tegra/
sudo ./tools/l4t_flash_prerequisites.sh
sudo ./apply_binaries.sh
sudo ./tools/l4t_create_default_user.sh -u myname -p passwd -n myname --autologin

sudo ROOTFS_AB=1 ./tools/kernel_flash/l4t_initrd_flash.sh \
  --external-device nvme0n1 \
  -S 56GiB \
  -c tools/kernel_flash/flash_l4t_nvme_rootfs_ab.xml \
  -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs \
  --network usb0 jetson-orin-nano-devkit internal


when I do the
sudo ./apply_binaries.sh
I get the error as:

The flash process is OK, and can boot successfuly to rootfs A.
When I test the ROOTFS_AB function, I remove the /lib to corrupt rootfs_A file system . Then reboot the device.
It hit kernel panic at boot up (due to filesystem corrupted) follow as Rootfs A/B redundancy fail-over mechanism in Jetpack5.1
But the device will keep the panic and can’t reboot automantical. So it can’t implement the 3 time retry and switch to rootfs_B.

I follow the post:

but there is not watchdog in /proc/device-tree

hello liu.junnan,

since the latest public release is.. JetPack 6.1/L4T 36.4 release version,
is it possible for moving forward to have verification?

I will use the jetson-inference to develop AI program on jetson, but jetson-inference not have a branch for L4T 36.4.

sudo dmesg | grep watchdog
I found message as:

systemd[1]: Using harware watchdog 'Nvidia Tegra186 WDT', version 0, device /dev/watchdog
systemd[1]: set hardware watchdog to 2min

does it mean the watchdog is enabled? But still not automatic reboot.

In /proc/device-tree/bus@0/watchdog@2190000, I found the watchdog status — disable, so I should compile the dtb to enable watchdog?

@JerryChang

After modifiy the watchdog@2190000 status from disable → okay in tegra234-p3768-0000+p3767-0000-nv.dtb,
I check the /proc/device-tree/bus@0/watchdog@2190000/status is ā€˜okayā€˜ now.
But the panic still not reboot suceessful. The bug still exist!
need help! thanks

watchdog in dtb is :

                watchdog@2190000 {
                        compatible = "nvidia,tegra-wdt-t234";
                        reg = <0x00 0x2190000 0x00 0x10000 0x00 0x2090000 0x00 0x10000 0x00 0x2080000 0x00 0x10000>;
                        interrupts = <0x00 0x07 0x04 0x00 0x08 0x04>;
                        nvidia,watchdog-index = <0x00>;
                        nvidia,timer-index = <0x07>;
                        nvidia,enable-on-init;
                        nvidia,extend-watchdog-suspend;
                        timeout-sec = <0x78>;
                        nvidia,disable-debug-reset;
                        status = "okay";
                };

and /proc/device-tree/bus@0/watchdog@2190000/timeout-sec is
00 00 00 78

hello liu.junnan,

we’ve test system redundancy on r36.3/Orin-Nano to confirm it works normally.

here’re flash commands and some test steps for your reference.
$ sudo ROOTFS_AB=1 ROOTFS_RETRY_COUNT_MAX=3 ./tools/kernel_flash/l4t_initrd_flash.sh --showlogs -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" --no-flash --network usb0 jetson-orin-nano-devkit internal
$ sudo ROOTFS_AB=1 ROOTFS_RETRY_COUNT_MAX=3 ./tools/kernel_flash/l4t_initrd_flash.sh --showlogs --no-flash --external-device nvme0n1p1 -c ./tools/kernel_flash/flash_l4t_t234_nvme_rootfs_ab.xml --external-only --append --network usb0 jetson-orin-nano-devkit external
$ sudo ./tools/kernel_flash/l4t_initrd_flash.sh --showlogs --network usb0 --flash-only

we’ve erase APP_A partition for testing.
$ ls -al /dev/disk/by-partlabel
$ sudo dd if=/dev/zero of=/dev/nvme0n1p1 bs=1M count=1
$ sudo reboot
it’ll switch to APP_B after 3 trials.

Thank you, I’ll try this immdiately and replay.

hi, @JerryChang
I have follow the steps and test the ROOTFS_AB function,
when do the erase APP_A partition for testing

sudo dd if=/dev/zero of=/dev/nvme0n1p1 bs=1M count=1
sudo reboot

it works怂

but do test as :

sudo rm /lib -rf

But the panic still not reboot suceessful. The bug still exist!

Can you try the test as sudo rm /lib -rf.
Because i can’t determind the crash reason. so I need to test everything I can think of.

please setup serial console for gathering the UART logs, we need to check complete logs for reference.

Here is the panic log:
panic.log (35.3 KB)

Hi, @JerryChang
There is a more detail panic log:
panic.log (70.9 KB)

hello liu.junnan,

let me re-cap the error logs..
could you please also confirm that PARTUUID.
for instance,
is it already switch to slot-B, or, it’s now still at slot-A?

[    9.905103] EXT4-fs (nvme0n1p1): recovery complete
[    9.905113] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[    9.910048] Rootfs mounted over PARTUUID=7aa20bb2-07e3-453b-a9f1-90199df2f0f3
[    9.917483] Switching from initrd to actual rootfs

anyways, did you put the device there for a while?
since it should have WDT timeout and then trigger a software reset.

hi @JerryChang
It still on the slot-A,
I let the device run here for 15-minutes at least, and not reboot.
How can I check the PARTUUID? please give me a hand, Thanks

please give it a try to apply below kernel patch for testing.

---

diff --git a/drivers/clocksource/timer-tegra186.c b/drivers/clocksource/timer-tegra186.c
index 3d662a4..99f95c6 100644
--- a/drivers/clocksource/timer-tegra186.c
+++ b/drivers/clocksource/timer-tegra186.c
@@ -57,6 +57,8 @@
 #define WDTUR 0x00c
 #define  WDTUR_UNLOCK_PATTERN 0x0000c45a
 
+#define WDT_DEFAULT_TIMEOUT 120
+
 struct tegra186_timer_soc {
 	unsigned int num_timers;
 	unsigned int num_wdts;
@@ -75,6 +77,7 @@
 	void __iomem *regs;
 	unsigned int index;
 	bool locked;
+	bool irq_enabled;
 
 	struct tegra186_tmr *tmr;
 };
@@ -175,7 +178,8 @@
 		value |= WDTCR_PERIOD(1);
 
 		/* enable local interrupt for WDT petting */
-		value |= WDTCR_LOCAL_INT_ENABLE;
+		if (wdt->irq_enabled)
+			value |= WDTCR_LOCAL_INT_ENABLE;
 
 		/* enable local FIQ and remote interrupt for debug dump */
 		if (0)
@@ -216,8 +220,17 @@
 static int tegra186_wdt_ping(struct watchdog_device *wdd)
 {
 	struct tegra186_wdt *wdt = to_tegra186_wdt(wdd);
+	unsigned int value;
 
 	tegra186_wdt_disable(wdt);
+
+	/* Disable WDT interrupt once userspace takes over. */
+	if (wdt->irq_enabled) {
+		value &= ~WDTCR_LOCAL_INT_ENABLE;
+		wdt_writel(wdt, value, WDTCR);
+		wdt->irq_enabled = false;
+	}
+
 	tegra186_wdt_enable(wdt);
 
 	return 0;
@@ -307,6 +320,8 @@
 	if (value & WDTCR_LOCAL_INT_ENABLE)
 		wdt->locked = true;
 
+	wdt->irq_enabled = true;
+
 	source = value & WDTCR_TIMER_SOURCE_MASK;
 
 	wdt->tmr = tegra186_tmr_create(tegra, source);
@@ -331,6 +346,13 @@
 		return ERR_PTR(err);
 	}
 
+	/*
+	 * Start the watchdog to recover the system if it crashes before
+	 * userspace initialize the WDT.
+	 */
+	tegra186_wdt_set_timeout(&wdt->base, WDT_DEFAULT_TIMEOUT);
+	tegra186_wdt_start(&wdt->base);
+
 	return wdt;
 }
 
@@ -411,7 +433,7 @@
 {
 	struct tegra186_timer *tegra = data;
 
-	if (watchdog_active(&tegra->wdt->base)) {
+	if (tegra->wdt->irq_enabled) {
 		tegra186_wdt_disable(tegra->wdt);
 		tegra186_wdt_enable(tegra->wdt);
 	}

I’ll try this

Hi, @JerryChang
When I use the patch, I got this:

patching file drivers/clocksource/timer-tegra186.c
Hunk #3 FAILED at 178.
Hunk #4 succeeded at 220 (offset 1 line).
Hunk #5 succeeded at 320 (offset 1 line).
Hunk #6 succeeded at 346 (offset 1 line).
Hunk #7 FAILED at 432.
2 out of 7 hunks FAILED -- saving rejects to file drivers/clocksource/timer-tegra186.c.rej

did you use the kernel source as:

https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/Kernel/KernelCustomization.html#building-the-jetson-linux-kernel
To Manually Download and Expand the Kernel Sources
In your browser, go to https://developer.nvidia.com/embedded/jetson-linux-archive.
Locate and download the Jetson Linux source files for your release.
Extract the .tbz2 file:

please tell me which tag you are using for r36.3.0, and maybe you can try the steps same time as

hello liu.junnan,

there’s dependency, please give it another try to apply the following kernel patches.
it should be apply to l4t-r36.3 directly.
0001-Revert-NVIDIA-SAUCE-clocksource-drivers-timer-tegra1.patch (2.1 KB)
0002-clocksource-timer-tegra186-Enable-WDT-at-probe.patch (3.2 KB)

1 Like

I have try this and it is no effect

Hi, @JerryChang
when boot normal, I found a error with watchdog:

[   10.004686] systemd[1]: Using hardware watchdog 'NVIDIA Tegra186 WDT', version 0, device /dev/watchdog
[   10.004704] systemd[1]: Set hardware watchdog to 2min.
...
...
...
[   11.308104] tegra_wdt_t18x 2190000.watchdog: can't request region for resource [mem 0x02190000-0x0219fffe]
[   11.308110] tegra_wdt_t18x 2190000.watchdog: Cannot request memregion/iomap res_wdt
[   11.308111] tegra_wdt_t18x: probe of 2190000.watchdog failed with error -16

the watchdog kernel module run after the systemd?

and in panic log, there is no log about watchdog, did the watchdog kernel module will not run before the filesystem start?

normal.log (43.3 KB)
panic.log (35.5 KB)