Mount failure (maybe after power cycle)

I successfully installed Jetpack 4.4 on SD with my Jetson NX production module and installed all the SDK stuff.

I left the unit running for a couple days and may have cut and reapplied the power once or twice during that time.

When i returned to use the device i had the following error (see image):

Could this have been caused by the power cycles? How can I avoid it happening again? Is there a way of recovering from it?

I have now reflashed the device again and it is working again but I certainly need to work out how to make the device more rubust for customers.

Yes, this is likely the cause. The ext4 filesystem has a journal to play back and remove uncommitted changes after a power cut, but it has a limited size. If there is enough data being written at the time of power loss, such that the data size exceeds the journal limits, then there is nothing the journal can do and the filesystem becomes corrupt.

You have to avoid cutting power without proper shutdown. You might be able to make shortcuts, e.g., forcing remount read-only and then cutting power, but normally treat this like your desktop PC…don’t yank the power cord to turn it off (imagine if you always pulled the power plug on your PC to turn it off or on).

I’ve found it difficult to drop in to repair on a Jetson the way you can with a PC. Basically I’d have to sit there and see if there is a way to drop into a root shell, but I don’t think you can with a Jetson. The one way I know for sure is to clone the rootfs, fsck it on the host PC while covered with loopback, and then reflash the Jetson using the repaired clone instead of the default generated image. This of course takes a lot of time and the host PC needs a lot of spare disk space.

Of course if you have an SD card model instead of an eMMC model, then life is much easier because you don’t really need to clone. You could just run fsck directly to the SD card from the host PC and not even need loopback. What model do you have?

hi @linuxdev thank you for your answer. I’m using the Photon carrier board from Connect Tech which has a Xavier NX Production Module and SD card. I’ve had to use SD as jetpack + the SDK components used 99% eMMC. I shall ensure that I do a controlled shutdown going forward in my tests. However, its not ideal or always practical for customers to do so when using the NX in the field. Is there way i can configure it to be more robust so that the system can self-recover if something gets currupted (or avoid curruption in the first place)?

I have not worked with it, but there are ways to have a backup image to boot to. Even so, it doesn’t mean you’ll be able to repair everything in the corrupt partition the way you would want to. There are bound to be lost files and data, and that is just the nature of losing power in the middle of a write.

A desktop PC would also fail if it were in the same circumstances. However, someone worried about clean shutdown with a PC would probably use an UPS. The PC user would also be advised to make regular backups. Those two concepts apply to the Jetson as well. It isn’t so easy to keep regular backups of a Jetson. You could operate this on battery and have a failure detect which tells the Jetson to do a clean shutdown before battery power is depleted. The Jetson is a full Linux operating system, and you can’t really ask it to survive anything a full PC would not survive.

  • Can you say more about the cause and/or predictability of when there might be an unexpected abrupt power loss?
  • How much warning do you have that power is about to be lost?
  • Can you say more about whether there are possibilities of backups?
  • Is the SD card just data storage, or has part of the operating system itself been moved to the SD card?

Any detail about your current situation might help.

If I used a NVMe M.2 SSD instead of an SD card, might that fare better on sudden power loss?

No, the different storage types have the same underlying problems. Every bulk storage device with any significant performance out there has some sort of buffering, and writing to the device goes through that buffering. When a command to write to disk is complete, the write is rarely actually complete…buffer flushing is still needed.

On an old tech hard drive you could disable caching/buffering and write synchronously, but performance would be absolutely terrible.

On solid state memory, if you were to disable caching/buffering and set up for synchronous access, then it would also have terrible performance (although it would likely be better than old tech spinning hard drives). Unfortunately, this would also destroy the SSD or NVMe very quickly. You could expect under those circumstances to not only operate with bad performance, but also to be replacing the device quite early in its shortened life.

If you really wanted to get away with yanking power, then you’d have to provide some sort of internal battery which gave it time to shut down even when the device has been physically removed.

Some people compromise when the situation is controlled (especially with kiosk style appliances), and might use OverlayFS with an underlying read-only filesystem. What this does is read from the disk itself, but writes go to a special RAM layer which looks like the disk. As writes occur, then reads switch to reading the RAM. The end application has no knowledge of this and it just reads or writes the correct thing automatically. Upon reboot the system starts with a fresh copy of the original disk, and loses all changes. If you wish some sort of long term operating in a normal situation (such as expecting changes to remain after power loss or reboot), then this won’t do the job.

The desire to be able to simply cut power and have data remain, while the life of the device is not harmed and performance is good is a bit of the search for the holy grail. Doing this without backup power to complete the write requires new tech which does not yet exist.

hi @linuxdev Thank you very much for your detailed response. I am going to change the wiring so that, during normal use, the switch always supplies power and it will just act as a way of triggering a controlled shut down or wake from idle.

However, I will still need to ensure my system is robust enough to handle uncontrolled shutdowns. I’m thinking maybe I could use a relay and a capacitor so that the relay activates on power loss and instantly begins a controlled shutdown and the capacitor provides enough power to allow the NX to get into a safe state before power is gone. Would you know roughly how much time the nx needs to get to a safe shutdown state on receiving a sudo shutdown -h now command? 5 seconds? 10 seconds?

This is a good approach, but I could not tell you about how much time is required. It depends on circumstances.

Normally I would use command line “sudo shutdown -h now”, and this performs a full and normal shutdown. This method of shutdown can be “nice” to applications, but performance during shutdown will depend on what is running. In cases where an UPS (providing about 1 to 5 minutes of run power) is used this would pretty much always be all that you need.

Normally I would consider something known as “Magic SysRq” to be a feature for development or for emergencies, and this provides an ability to perform an emergency sync (of the disk) and remount of the filesystems to read-only. This would not necessarily be as kind to a user space application in the sense of allowing the application to flush its own buffer, but it would pretty much guarantee that whatever is buffered for hard drive write gets written and the filesystem protected in a shorter time than normal shutdown. This could possibly work as a reliable method of protecting the system with 5 seconds of power.

Regarding Magic SysRq, normally a system which has a local keyboard attached can enter certain keystrokes, and these are bound to a debug set of code listening for those keystrokes. These keystrokes tend to work even if much of the system is otherwise failed. For example, if I have a system lock up on me, I might run the following key bindings (“ALT” and “SYSRQ” are actual keys):

# Call sync twice:
ALT-SYSRQ-s
ALT-SYSRQ-s
# Remount read-only ("u" for "umount" and then read-only mount):
ALT-SYSRQ-u
# Force immediate boot (which is ok because we have flushed buffers and switched to read-only):
ALT-SYSRQ b

You won’t have a keyboard, but you can also “echo” combinations into “/proc/sysrq-trigger” to achieve the same effect. The history is that people using kernel debuggers over a serial UART do not have access directly to a keyboard, but need to stop and start the kernel or enter different modes (check out KGDB and KGDBOC), and sysrq-trigger is used for this. Stopping a kernel or starting it for a debugger is not particularly different from stopping or starting parts of the kernel due to a bug or other failure.

Note that for security purposes one has to be user root (or sudo) to work on this, and that different parts of Magic SysRq can be enabled or disabled through a mask. To see current settings:
cat /proc/sys/kernel/sysrq

If the value is “1”, then all Magic SysRq which is supported by that CPU will be allowed (a desktop x86 PC might have different options than an ARM64 CPU). If the value is not “1”, then you can add the configuration to enable this from “/etc/sysctl.conf” (not normally required on a Jetson, but if the value is not “1”, then edit this in):
kernel.sysrq=1
(“1” enables all functions because it is a mask…different bits being off will have a value of other than “1”, and will disable functions from bits which are not “1”)

Different software releases might or might not default to “1” in “cat /proc/sys/kernel/sysrq”, and so if you worry about one release behaving differently, then you could enable this in “/etc/sysctl.conf” even if it is not technically required.

Note that whenever you flush a buffer to solid state memory you are producing “wear”. The more often you sync (“flush”) the lower the lifetime is for your solid state memory. It certainly is not going to change the life to sync right before a shutdown since normal shutdown would do this anyway, and testing this a few times also is not going to wear this out any more than shutting down and restarting would. Just don’t overuse sync.

The reason for using sync twice is because no disk will guarantee that just because the sync has begun that the sync will have actually completed before forcible loss of power. A while back @snarky mentioned that the second sync will not complete if the first sync is still in progress…the second sync returning implies the first sync actually finished its buffer right (versus simply starting the buffer flush). Following this by the forcible remount to read-only is your best bet. Once that read-only state is enabled you can cut power any time you want and there will be no filesystem corruption issues.

There are other ways to do this, but the Magic SysRq was designed to run in emergency situations, and you wouldn’t need to invent new code. This is probably the fastest way to go read-only. You could use an actual keyboard to input those commands, or even a bash script, and see how much time it takes to complete remount to read-only. Make sure to test when any application you have which writes to disk is actually writing…if nothing is writing at the time of the remount read-only, then you will always get a fast response to remount.

Note: There are also Magic SysRq commands to kill processes other than init…this could cause programs which normally write to terminate, and then follow this with sync and there would be less worry about some process doing a lot of writing in the middle of calling sync since those processes would have ended.

1 Like

wow thank you so much for taking the time to write such a detailed and well thought out response. Ill pass this to my developer and see if we can get this implimented :)