Is Jetson software protected for incorrect shut down?

Hi,
I’m investigating if we can use a Jetson Orin module for a new product.
My biggest fear is that a customer does not shut down the Jetson correctly or that there is a power loss and because of this the Linux system gets corrupted.

Is the system protected for this so that the system always will startup correctly?

Jetsons are full computers, and not a microcontroller. This means most of anything applicable to Linux is applicable to a Jetson. The filesystem type is ext4, which is a journal style filesystem. ext4 only protects against corruption, and not against data loss. That protection is limited by the size of the journal, and so if enough data is unwritten at the moment of power loss, then corruption would also occur. There is basically no such thing as a full computer with a read/write filesystem which won’t suffer from incorrect shutdown. You can find more details here:
https://forums.developer.nvidia.com/t/topic/252664/5

If this is used as a kiosk, then sometimes people will use the overlayfs filesystem. All this does is to make the filesystem read-only, and it emulates write with a RAM layer on top of it for any writes. If power is lost, then it reverts to the original filesystem and all RAM edits are gone. This is suitable for some appliances, but not so much for normal computer use.

Consider what you do with a Jetson to be just like a desktop PC. If you wouldn’t yank the power cord on that, then don’t do so on a Jetson.

You never know what customers will do. And a sudden power loss can happen always.

I guess I can live with data corruption. But the computer should always be able to start up.
Will that be the case?
Maybe I can run some file system check after startup to check for file corruptions and repair that if possible.

No. The hardware won’t fail, but the software in any writable partition may corrupt. If this were a desktop PC and you were to yank the power in the middle of operation, then I’d expect it might need the operating system repaired or reinstalled.

One solution is that some manufacturers will add an uninterruptible power supply, and when power is lost, then there is an automatic emergency shutdown before the power supply battery wears out. This would require that any power button not actually cut power from the Jetson itself, but would only cut power to the battery backup. Jetsons use only a tiny amount of power compared to a desktop PC, and so an uninterruptible power supply tends to be far smaller and compact than anything that would work on a PC. In some cases something like a super capacitor would suffice in place of a battery.

There is no case of a filesystem corruption which does not have the possibility of failing to boot. Improper shutdown is a proverbial “accident looking for a place to happen”. Often the Jetson will survive this and boot ok at the cost of losing whatever cached/buffered writes are not flushed to disk, but as soon as either (A) that content is used in boot, or (B) the amount of content exceeds the journal size, all guarantees are gone.

The problem is that this product will not have a monitor, keyboard and mouse connected (in normal use). So for the user this may not feel like a computer, but just some device. Like a printer for instance.

Is there a way to make the system more tolerant to unexpected power switch off?

Can we for instance after boot just not use the file system anymore. And if we need file storage for the application use a more robust file system on some other storage device?

The URL I added in post #2 talks about getting around such an issue. If the end user may do this incorrectly, then your shutdown hardware needs to force trigger of a proper shutdown. This is only possible if power remains on long enough to do so (on a Jetson, which doesn’t use much power, this is probably easier than with a desktop PC). There really isn’t much of a substitute.

Tradeoffs:

  • Run synchronous without cache/buffer. This will destroy solid state memory rather quickly, and performance will be far far worse.
  • Run a much larger ext4 filesystem journal. This is not foolproof, as exceeding the journal will still cause corruption. In the case that something was written to disk related to whatever you were running the system for, that data will be lost even if the system boots without corruption. Journals prevent corruption, they don’t stop data loss.
  • Run it as a read-only filesystem and add changes to RAM only. The filesystem will never need to recover a journal, but all information written to disk will always be 100% lost if written to the overlay of OverlayFS.
  • Make changes to a remote device over a network, and don’t change the Jetson itself. This has so many disadvantages I won’t even mention them.

Just to gain perspective, consider that the journal itself is synchronous. It is actually safe to the journal itself to cut power randomly. The journal is only updated as content on the filesystem is actually written to disk. This is why a journal can be used to reverse changes to metadata about the filesystem for particular incomplete write operations. These are exactly known changes in a tree structure. Then consider that if there is no journal for the data changed but not written, then there is no knowledge of where to reverse this content to excise just the changes. Because it is a tree structure, when there is not a large enough journal to see the history of change, then a change to any location on the disk can corrupt any or all of the rest of the disk when cutting out the “unknown” (no in journal) issue. In this latter case of not having a journal, it is possible that changing a README file in your home directory ends up with an fsck operation which cuts out a piece of the kernel itself, or init might be lost and the entire system cannot boot. There is no limit to what or how much of the system is lost when blindly “fixing” corruption without a journal history.

If we had a “fix” step to repair content after a corruption, then it would only “seem” to “fix” it. The filesystem would indeed be fixed. Whatever is lost can be astonishingly import and not even be visible until the system has further destroyed itself.

You are limited to either eliminating the failure which cuts up the disk, or backing up to the outside world in order to be able to restore through backup. File checks do not fix damage, they simply reduce this to where a successful boot might be useful. A system which flushes buffer and cache to disk every 10 seconds or so will always be at risk of losing the last 10 seconds of transactions. There is no operating system on earth which can change this.

Thanks for the detailed answer.

I have no problem that data get lost if there is an unexpected power down. My problem is that the system should not get corrupted. It should be able to start up always.

See it like a printer. I printer should never get corrupted because of a power failure. I can not ask a customer to install new software because of a power failure. That would be a very bad user experience. If it happens in the middle of a print, the print gets lost. Not problem with that.

Maybe I should go for the read only file system option.
Maybe I can add an extra SSD that is not used by the OS. And use if for storage of data for my application only. Normally nobody writes to that, so no power failure problems. And when I write something the user will probably be aware of it and not shut down the power. And if he does, I can always recover because the OS still works.

That makes me think: is it possible to use different disk partitions for that? Thus one write protected partition for the OS and one r/w partition for my data (which can be reformatted if needed)?

One must distinguish between synchronous disk writes and buffered/cached writes (which are asynchronous). The journal can be tuned to have a larger size, but this has its own tradeoffs. Once you exceed the journal’s records (which are used to reverse information with an incomplete write), there is no such thing as a system which won’t corrupt.

Some microcontrollers, which are astonishingly slow compared to what you are thinking of, run entirely synchronous. If the controller does not have a need to write, e.g., its programming is in a read-only ROM, this works out well and the system is immune to corruption.

Several things happen if you are unable to live with a small journal, and need an actual guarantee:

  • Solid state memory “wear leveling” will destroy that memory (eMMC or SD card examples) in a very short time.
  • Performance will drop by orders of magnitude. This is not a small or insignificant drop. We’re talking about performance dropping back to something from the 1970s for any kind of storage which won’t have wear leveling issues (e.g., old style disk with a spinning platter does not have issues of wear leveling; the performance of these disks, without cache/buffer, is far slower than you would expect…cache/buffer is an enormous speed boost).
  • A kiosk style application reads from solid state memory, but only writes to a RAM buffer; that RAM buffer overlays the read-only solid state memory to give the illusion of read-write, and is limited by how much RAM you have. When power is lost, there will never be corruption, but you will lose 100% of anything written during that boot.

Note that a custom ext4 tuning can increase the journal size. This implies less for storage. The journal itself does contribute to wear of solid state systems. Anything which writes contributes to this, but writing to cache/buffer millions of times before shutdown implies there was only one write. The larger that synchronous journal gets, especially in relation to the total size of the eMMC or SD card, the faster that memory will fail. The existing defaults don’t have much of an issue with causing failure, but a larger journal might.

Some of the Jetson hardware is designed with an A/B redundancy whereby there is a backup partition. If one partition fails, then it will go to the other partition. Whether or not this allows repair of the original failed partition is a question you have to ask at each failure. Certainly this will involve a human intervening for repairs.

In the case of a system where you don’t think it writes, there are in fact some small writes often needed. Consider that lock files, which don’t contain anything, do in fact write to the content of a directory. Named pipes tend to have a filesystem entry even though any “content” is going through a driver and not the disk. If you are certain that nothing other than the o/s lock files and temp files are being written, then you have a very good chance that even a small journal will prevent corruption. Still, this is not a guarantee unless the filesystem is truly mounted read-only.

If your data storage and all significant writes go to an external SSD, and if that SSD is not the operating system partition, then this will have a lot of advantages. Speed is one of them; less wear of the eMMC is another. However, the SSD itself must have a journal and buffer/cache if it is to operate “normally” like any other partition which can be written to. You will lose data on the SSD from power loss. If that data exceeds the journal, then the SSD will corrupt. If the mount options to a corrupt SSD are not set up correctly to tolerate error, then boot will fail. Still, it is easy to set up such that the SSD won’t fail boot if it corrupts. Then you could fix the SSD corruption at the risk of significant loss of anything on that partition.

An option to improve this situation requires knowing something about mount options. If you have something like an eMMC rootfs partition for everything, and there is content in a directory, a mount of an SSD with a blank partition onto an eMMC mount point which has content will cause that content to be hidden. The “hiding” goes away upon unmount. This means that if you were to do something like copy “/home” to one SSD partition, and then mount that SSD version onto the “/home” as a mount point, then the content on the eMMC is protected until the SSD fails to mount. Any updates to the SSD would not go to eMMC though. If for some reason you have a basic “/home” on eMMC, and the SSD partition mounted on “/home” fails to mount, then this will revert to the eMMC version of the content. At this point the eMMC begins writing instead of the SSD.

Common disk arrangements on any *NIX, for reliability purposes, might include these items:

  • A separate partition for:
    • /var
    • /tmp
    • /home
  • Use of rsync on occasion to update the eMMC /home from the SSD /home (optional).

What the above would do is to make even temporary files and logs write to the SSD. You have to be careful though with “/var” since this has the dpkg/apt package database. Installing new packages won’t happen often, but you’d likely want to update both copies every time you change packages, even if it is just an update of existing packages and nothing new.

The /var can be set up more finely grained, and so you could for example not use the SSD partition for all of /var, but instead do something like mount the SSD partition on /var/log.

Of course if you have three separate SSD partitions for /var/log, /tmp, and /home, then you cannot share space between those three partitions. You have to have a large enough partition for each individual mount point, and if you choose too little or too much, then you’re going to have a lot of work ahead of you to tweak that. LVM (logical volume manager) can help deal with this, but then your boot options will get a lot more complicated.

You won’t find any magic bullets which makes the system perfectly safe without either an extreme performance hit or reduction in solid state memory life.

Incidentally, you would not use the “nofail” option for the rootfs “/” or the “/boot”. However, if you have correctly set up a backup “/home” and then copied it to SSD, then the mirror partition could be mounted with the “nofail” option. The average user might not realize that “/home” had failed and reverted to eMMC. However, here is an example entry for “/etc/fstab” to mount an SSD partition with the ability to continue boot even if it fails (I’m pretending the SSD partition is “/dev/nvme0n1p1”, but you have to adjust for whatever the actual SSD partition is):

/dev/nvme0n1p1   /home  ext4 defaults,nofail  0   2

In the above I assume the “2” because the rootfs is “/” and is “1”. The “1” means the first partition to error check, the “`2``” means the second partition to error check. Recovering a journal on rootfs first, followed by the device which mounts on the rootfs, is the logical order.

The “defaults” is due to trying to create a “normal” mount. This is actually an alias for several options. The option list is comma-delimited, and so by following this (without spaces) using “,nofail” appends that option. Normally, if “/home” failed to mount, then boot would halt and offer some sort of rescue environment or suggest fixing the device before boot can continue. The “nofail” won’t do this; logs will show the failure, and boot will continue without the mount of “/home” on the SSD. You of course can’t do this with “/” or “/boot”, but you can do this with a lot of partitions (above suggestions for “/var/log” and “/tmp” are candidates).

Some people choose to use a network device for “/var” or “/var/log”, with nofail, and so logs can have a network device attached for debugging, but detached for normal boot.

One final note: A partition UUID can be used in place of a device, e.g., the “/dev/nvme0n1p1” could instead name one exact and specific partition. However, if you were to replace the disk with a new one, this would only mount if the UUID were cloned. Sometimes this is what you want because it gives you a chance to rsync your “/home” to a new SSD, and then set the UUID in /etc/fstab for the new device.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.