Jetson TX2 - redundant copy of RFS partition?

Hi,

I’m working on a task to ensure the device boots everytime in production environment. I’ve enabled bootloader redundancy so the device boots from the redundant partition if it fails to boot from the primary boot partition. However what happens if my rootfs corrupts and fails to load ?
Is it possible to split the APP partition (28G currently) into two 14G redundant partitions so if the primary root partition gets corrupted and or for some other reason fails to load correctly, it loads from the redundant partition ?
Any other recommendation ? Thanks

This probably won’t help much, but basically you might be talking about having the rootfs run in RAID1 where two partitions are mirrored and a partition failure will use the remaining working partition. Or…are you suggesting having an unmodified and unused partition which only takes over if the first partition dies…and where the first and second partitions are purposely not kept in sync?

There is an issue with RAID on rootfs because you’d need to use an initrd. I have not yet found a way to use an initrd on recent releases:
[url]https://devtalk.nvidia.com/default/topic/1041772/r28-2-initrd-and-device-tree-/[/url]

For separate non-RAID rootfs partitions it might not be too bad, but this would have some weaknesses. Currently U-Boot has some environment variables which are macros. Those macros have a default device search order, and so for example if you have an SD card with a valid “/boot/extlinux/extlinux.conf”, then this will be loaded instead of the eMMC version. That macro for search order can be edited quite easily from a serial console at the U-Boot prompt. The main weakness is that if the extlinux.conf survives, but the rest of a partition has something fatal, then the bad partition would still be booted…the test only tests for existence of extlinux.conf and not for other failures (e.g., a bad kernel will get booted every time if the extlinux.conf points at it…a failed partition would correctly fail over to the other partition since extlinux.conf would not be found).

You might want to specify more exactly what kind of failover situation you are interested in recovering from.

Yes, you can split it to two partitions. Cboot passes to u-boot the cmdline environment from which you can derive which set of files/partitions you’re booting, it’s relatively easy to extend u-boot env with a command which walks through bootargs from cboot and determines which partition to look for extlinux.conf. Extlinux.conf can refer to u-boot environment variables which are set with setenv, this can be useful for passing right rootfs device to kernel cmdline (e.g. if you have the same rootfs image on both partitions).
Update software from nvidia makes sure that if given set boots it marks the current set as working. If e.g. system crashes *) upon boot from given set of bootware (and rootfs partition in this case) enough times (hardcoded 7) it will switch to other set (this should be true during update, I don’t know at the moment if it’s still true for regular boots after first successful boot after update).

edit: *) A note regarding crashing: this will work if the kernel reboots itself in case of crash. Normally watchdog should make sure of that, but nvidia’s kernels are configured to reboot on panic and watchdog doesn’t work/shouldn’t be used AFAIK.

Thanks for your response and all the information, linuxdev and p.figiel.

Regarding the failover situation, I do not know exactly … maybe possible emmc blocks/files corruption or physical damage in harsh environment. Right now just trying to evaluate the options I have.

Here are the 2 tracks I’m evaluating -
a) Redundant APP partition
I did split the APP partition into 2 (showed up as mmcblk0p1 andd mmcblk0p2) with the system.img loaded in both. By default, it booted from the first partition mmcblk0p1. Added a block in extlinux.conf to boot from mmcblk0p2. Changing the default to switch between mmcblk0p1 and mmcblk0p2, it correctly booted to the RFS from the expected partition. However like linuxdev mentioned, if I emulate some fatal issue in the file system say missing ‘/etc’, the uboot doesn’t know about it and boots from the bad partition and gets stuck with no way to recover. Don’t think switching to an alternate RFS partition would happen on the fly without manual intervention.

b) Configure an external SSD to act as an alternate boot/storage device. Working on this now, however, the question is if it fails to boot from eMMC, would it automatically boot from the alternate boot device - SSD in this case ? what if the bootstrap is fine but there is failure in booting the RFS? Probably no way to recover then… ?

Regarding watchdog, if your point is about the known issue in L4T 28.2.1 - 20037708 (watchdog and redundant boots), there is no mention of it being resolved in the next/latest release 31.0.2, I asked about it in another related thread, waiting for some response now. It anyway won’t help in this scenario where it is stuck loading the RFS but still may help in many other crash scenarios in a fully booted system.

R31.0.2 is only for Xavier, and not for a TX2.

Right now the macros within the boot environment only check for finding extlinux.conf. If the file system is corrupt such that the file cannot be found, then it goes on to the next device to test for finding extlinux.conf. Unfortunately there are a lot of ways to trash the rootfs without trashing extlinux.conf.

On the other hand, you could perhaps test for other files as well. This would have the same weakness in terms of not really determining if boot succeeded, but instead determining only if those files exist.

You could use a risky scheme such as having the booted system remove the first rootfs’s extlinux.conf (I would actually have it gzip instead of rm), and only upon reaching a certain success at booting place the extlinux.conf back where it should be. If extlinux.conf is not found, then it would go to the alternate partition. Unless you are in some sort of specialty situation this would be rather fragile.

Unless you can get initrd to work you can’t boot to a RAID1 rootfs. If you were able to do this, then failure of a partition would automatically do the right thing and run in degraded mode. On the other hand, any write of files which makes for an invalid boot would cause both mirrors to be invalid and it still wouldn’t do what you want.

I don’t know of any easy solution.