Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System

Hey @KevinFFF ,

we have a custom board too, but I am experiencing the issue on the Xavier NX Devkit first of all.

Here is the log for 5.0.2:
kernelPanic5.0.2 (83.1 KB)

Here is the log for 5.1:
kernelPanic5.1 (69.6 KB)

Two people confirm that the failover mechanism does not seem to work here:

Could you manually swith slot with the following command?

$sudo nvbootctrl set-active-boot-slot <SLOT>

Or you mean that kernel panic could not trigger a reset on Jetpack 5.1?

Switching the boot slot works as expected.

The kernel panic will not trigger a reset on Jetpack 5.1.

The whole A/B system has no use though, if the reset will not happen in case something goes wrong.

Could you help to provide the detailed steps so that we could reproduce if the rootfs A/B not work after kernel panic?

  1. Flash the board with ROOTFS_AB=1 ROOTFS_RETRY_COUNT_MAX=3 commandline
  2. Boot the board. Delete the content of your rootfs “rm -rf /*”
  3. Reboot
  4. Kernel Panic will occur, but the system will not attempt to boot from slot B

Expected would be that it waits a bit and then attempts 2 more times to boot from slot A, afterwards it should boot from slot B

is it stuck somewhere? it’s by default having 3 attempts, it’ll moving to another rootfs slot if failed.

It stays at the kernel panic since 5.1

On the previous versions it still worked.

hello seeky15,

I meant…
don’t the board attempts 2 more times to boot from slot-A, then it boot from slot-B eventually?

No, it has a kernel panic and stays at that line forever.

Before in 5.0.2 it would take 2 minutes or so for it to attempt a reboot. That does not happen anymore.
Am I not cummunicating clearly enough?

I thought

The kernel panic will not trigger a reset on Jetpack 5.1.

would be enough for you guys to reproduce it? At least 4 people reported the same in the last days…

Hi seeky15,

Are you using Xavier NX with eMMC or SD slot module?

Hey @KevinFFF , it’s a Devkit with an SD Card. Exactly as it is shipped.

This issue looks more like the watchdog issue.

From your log of JP5.0.2, there’s watchdog has been registered successfully

[    0.394377] tegra_wdt_t18x 30c0000.watchdog: Expiry count is deprecated
[    0.394651] tegra_wdt_t18x 30c0000.watchdog: Tegra WDT init timeout = 120 sec
[    0.394703] tegra_wdt_t18x 30c0000.watchdog: Registered successfully

But from log of JP5.1, there’s no related watchdog messages.

Could you help to check the result of the following command on your Xavier NX devklt with JP5.1:

$ cat /proc/device-tree/watchdog@30c0000/status

And you could just use the following command to trigger kernel panic and see if watchdog would work after 120s

# echo c > /proc/sysrq-trigger
1 Like

Yup, that’s an issue:

cat /proc/device-tree/watchdog\@30c0000/status
disabled

Not trying the kernel panic as it obviously won’t work with disabled watchdog…

                                watchdog {
                                        wdt-boot-timeout = <0x20>;
                                        wdt-suspend-timeout = <0x78>;
                                        status = "disabled";
                                        phandle = <0x41b>;
                                };

I unpacked the L4T 35.2.1 from here: https://developer.nvidia.com/downloads/jetson-linux-r3521-aarch64tbz2
File /kernel/dtb/tegra194-p3668-all-p3509-0000.dtb and converted it to dts…

The same there. You disabled the watchdog for some reason…

Please help to enable the status of tegra_wdt:watchdog@30c0000 in tegra194-soc-base.dtsi and see if the watchdog would work.

Hey @KevinFFF

since the compiled DTB is used that won’t have any effect unless I compile the device tree myself.

Why does the customer have to do such stuff. You have developers which could analyze why the watchdog is not working?

1 Like

We’ve checked with developer that watchdog not work due to status disabled in device tree. I’ve also verified that watchdog would work after enabling it on the devkit.
If you don’t want to modify anything at your side, why you ask the question here and so many questions…

Hey, you guys are the ones telling us which features are supported in 5.1

If you tell it works but disable the feature in the device tree then this is called a bug.
Fix it, we bought your hardware, we expect working software.

Quote from Update and Redundancy — Jetson Linux Developer Guide documentation
The slot that the running system booted from is called the **current slot**. The other slot is called the **unused slot**. The system exchanges the roles of the current and unused slot in the course of an update, or when the software fails repeatedly during or immediately after a boot.

After 5.0.2 I thought the software could not become any worse…

Besides, are you really telling me that you could not find out with a test that A/B failover is not working in 5.1 for 2 weeks, but on the other hand you’re able to test your fix within 28 minutes?

Hi,
We are checking further. As a quick solution, please modify the device tree to enable it:
Jetpack 5.1 Kernel Panic does not lead to reboot with A/B System - #18 by KevinFFF

Hey @seeky15,

Thank you for referencing me!

I think solving the watchdog problem alone wouldn’t have solved the whole issue from the beginning, not? because the purpose of the watchdog is to restart the system after the kernel panic and we can actually do that manually by pressing the restart button without a watchdog, which will trigger the system to go into this restarting endless loop. I can confirm that this endless loop problem is happening also on my side.
Regards,
Max

Hey KevinFFF,

Thanks for the important information about the disabled watchdog. Now I can understand why didn’t the system restart automatically after the kernel panic, but unfortunately it didn’t switch to the backup rootfs partition and went into an endless loop.

Regards,
Max