Boot failure on L4T 35.4.1

Hey,

we have updated our OS to the latest L4T. So far I have not experienced any issues. But today after rebooting several times my board is stuck and does not get restarted by the watchdog. Find the boot log attached. What could cause this behavior? Is this a known issue?

BootFailure.txt (87.7 KB)

A restart does not solve the issue. It looks like the system is broken.

Didn’t see any suspicious fatal log.

I notice you are using a DP monitor, how is the status on DP? Is the gui still there?

The GUI was black.

I went into the UEFI and saw that both slots are marked as unbootable. How can that happen?
After making them bootable again I could boot.
The only thing I did, which could be related is that I changed the watchdog timeout to 30 seconds. When the system reboots it tells me that watchdog0 did no stop. Could it be that the watchdog increments the failed boot count while the system takes longer than 30 seconds to boot?

I feel it is possible.

Could you test more times to confirm it is really related to WDT timeout? I would like to test on devkit too.

Will test it and let you know if it goes away if I increase the time.

1 Like

After watching what happens I can see that I missed the kernel panics due to the short watchdog timeout.

Somehow after a while the system stops being able to read the extlinux file. It tries 3 times the active slot, then 3 times the other slot and goes into recovery.

There is always a kernel panic when switching from initrd to rootfs. As soon as I make the slots bootable everything works again:
BootFailure2.txt (81.4 KB)

OpenAndReadUntrustedFileToBuffer: Failed to open boot\extlinux\extlinux.conf: No
t Found

I doubt the watchdog timeout is the issue as there are kernel panics.
Is there anything in the log that you recognize?

Do you have the log prior to this situation happened?

Somehow after a while the system stops being able to read the extlinux file. It tries 3 times the active slot,

And could we narrow down it more with less I/O connected? For example, no DP connect this time.

Unfortunately the log ran out of the screen.

Now I changed the bootable flags again.
This time unfortunately the system keept telling me that it can’t read the extlinux file.
I unplugged all periphery, still kernel panics.

I’d suspect some issue with the ssd. Can’t imagine anything else that could come and go like this.
I took out the SSD and plugged it in again and it works again.

Either the SSD connection is bad or the system is buffering something so that it keeps it’s error state while I unplug the power for a few seconds. For removing the SSD it was off longer of course…

@WayneWWW

I’ve performed some tests.

The kernel panics do not seem to be related to the actual issue.

I tried on the devkit (With our image, so it’s modified compared to the standard, but this does not seem to be a behavior that I have changed):

  • Watchdog 20 seconds
  • Watchdog 120 seconds

When I login directly after boot and write “reboot”, after three times the system switches the slot, after 6 times the system goes into recovery

When I login directly after boot and write “sleep 120; reboot” the system performs normally.
This happens with watchdog on 30 and 120.

Which program is responsible for resetting the watchdog? I have the feeling that I am rebooting the system before the watchdog is reset. I can not observe any reboots due to kernel watchdog, the system reboots normally and is not interrupted while doing that.

Are you talking about you directly hit the switch slot thing after modifying WDT time with nothing else changed?

I changed the watchdog timeout in the device tree and reflashed.
So I performed the whole test once with a board flashed with 30 seconds wdt and once with a board flashed with 120 seconds.

Then I started reboot tests. in both cases if I directly reboot the device after it started, it will count each reboot as an unsuccessful boot.

Sorry about being unclear.

Are you talking about you reboot the device in a very fast delay right after boot up?

I guess if you put a longer delay, then it won’t hit issue.

Yes, that’s what I did.

When the delay is longer it does not happen.
What do I need to do so that fast consecutive reboots are working too?

Hi,

I remember there was a old post from @KevinFFF about the mechanism.

As you already knew, we have a mechanism to clear the scratch register in rootfs in each reboot. If the mechanism fails to clear register 3 times, it will become “cannot boot” situation and may enter recovery boot.

The problem here seems to be your reboot is too fast that the mechanism does not clear the register in time.

Hi @seeky15,

It seems you hit the same issue as the following topic.
Orin won’t boot after successive reboots - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums

Please check if the status of nv-l4t-bootloader-config.service is SUCCESS before you execute every reboot command.

$ sudo systemctl status nv-l4t-bootloader-config

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.