Jetson Nano corruption on power cycle

We are developing a product that utilizes the Jetson Nano, so while our custom board is in manufacturing, we are developing on the Nano dev platform.

Last night, I removed the barrel connector and plugged it back in to restart the unit. Given this is something a customer is likely to do (remove and add power quickly) we know this is a use case we have to cover.

The Nano never came back up. It would light up the green LED on the board, but no other activity (I didn’t have a serial monitor with me to test)

To recover the Nano, I had to use the sdkmanager and reflash on to the SDCard (I wasn’t sure how to just repair the boot loader which I think might have gotten corrupted)

All that said, now we are nervous about deploying these in the field. If someone cycles the power really fast, are they going to corrupt the unit? Is anyone else seeing this behavior? Is this just a problem with the dev sdcard version and the emmc version won’t do this?

All computers with buffered disk access do this. In this scenario the Nano is no different than your desktop PC. What would happen if you casually unplug and replug your desktop PC while it is running? You’d see a need to repair filesystems, and if the damage exceeds what the ext4 journal can handle, then probably you’d have to drop into a root shell to manually accept the risk of repair. Even if the damage does not exceed the journal (and thus the filesystem is not corrupt), then there would still be files missing based on what the journal corrected. If you wouldn’t do this with a PC you are in the middle of using, then you don’t want to do this with a Jetson.

Although you must use proper shutdown, there are ways to quickly flush the cache and buffers to disk and revert to read-only, but those are normally “manual” methods. Such methods are useful when you know something has already gone wrong, or when you know power is about to be lost no matter what and you don’t have a lot of time for a proper shutdown. So the question is this: Do you have even a few seconds of notice prior to power is pulled?

FYI, all forms of cycling power can do this. You’d also want to be sure to surge protect in some cases as well.

All of this is normal behavior, and is not a defect in any way. Despite being tiny, this is a full computer system. Read and write to the hard drive (or SD card…mass storage in general) is buffered and/or cached. It takes a moment to complete a write. Until that write occurs, the system is vulnerable to loss. One could set the disk up to run synchronously, and no buffering would occur, and then there would be no risk. However, performance would be absolutely terrible.

For solid state memory the reason for not running synchronously becomes a matter of “life and death” for the memory…wear leveling in solid state memory is necessary since actual writes do some damage. The normal method of having long life in solid state memory is to use buffering which only writes the buffered content to actual persistent memory when mandatory (e.g., there isn’t enough buffer left when someone needs to write something else, so it flushes to make more buffer available for the new operation).

When most people think of appliances which can just have the power cable yanked they are thinking of lower performance (but perhaps more real-time) devices set up to only need to run in RAM after reading from the disk. Those devices are not writing much (if any). Those devices do not run a full Linux operating system.

This answer doesn’t make sense though, because if I then swapped to a new SDCard with a freshly burned OS then it should boot with no problem, but that is not what the system was left in. Somehow, something in memory that exists outside the SDCard became corrupted, which I can only imagine was the boot loader, or some sort of programmable fuse?

Pertaining to buffering IO and Linux, I have systems running linux that are power removed abruptly for >6 years. NEVER has a single one done this. I have 400+ units in the field, using a Raspberry Pi computer module, not a single one has ever corrupted to the point that it broke the booting process.

Nothing is being written in terms of the configured kernel when you boot up in terms of being rewritten. There is no reason the bootloader, or device tree, or img files would be modified, so there would be nothing to flush. The worst case scenario is likely logs.

When we deploy a system, we change all the log writing to memory and disable almost all disk writing procedures. When we write to disk, we flush immediately since we do it so rarely, but just to be safe for the user (and all writes are user controlled).

Further, even beyond all this that I’ve mentioned, The Nano module is designed to handle a sudden power off, if you look at the schematics and look at the large caps being used, power loss is detected and the system has time to shut down (as specified in the specs for using the module, in the design reference guide).

Whatever happened here, something that allowed the bootloader I suspect to get corrupted, that is my worry. I’ve cycled raspberry pi’s the same way as I said for years and this has never happened.

Some of the boot content is in the QSPI ROM, which is added during flash. This is not written to during normal running operation, and so this would not have corrupted unless there was sort of severe power issue. However, if the two SD cards are not both intended to work based on the parameters used during flash of the Jetson itself, then the QSPI ROM content would not be valid for the other SD card. I suspect the failure to boot is not related to corruption during shutdown so much as it is a difference (very slight and not really very easy to find) in what is needed in QSPI boot ROM for the two different SD card cases. If the two SD cards were exact clones, then I can’t see the QSPI boot ROM mattering, and you would be correct about something corrupting.

I can absolutely guarantee that abrupt power removal and the results with a journal based filesystem depends on how much data is outstanding at the moment of power removal. In some cases even a bit of capacitance on the power rail gives a fraction of a second longer to flush buffers, so even that might help. You will never be able to safely yank the plug on a system which buffers/caches to the disk without a risk of losing something. If that something lost was a user document, then you’ll never notice it, but if the lost content is enough to exceed the journal replay capacity, you are guaranteed to lose something. Imagine if you were writing some needed boot content via update at that moment…then you might fail to boot in that case, whereas an office document wouldn’t ever be noticed.

In all cases where unbuffered/cached disk write do not complete at the moment of power loss (and in which case there is no special hardware involved, e.g., a hard drive with a built in backup capacitor), something is lost. Jetsons work exactly like any desktop PC, and the effect has nothing to do with being a Jetson.

Note that many simple microcontrollers don’t buffer/cache. Those cases are quite different from a full computer (and Jetsons are full computers).

When a system seems to be “not writing”, it probably is…just not anything you know about. To actually not be writing requires the filesystem to be mounted read-only if you want a guarantee.

You are correct that changing log writing to memory would be a big benefit to sudden power loss. It would also be quite useful to do what you mention, “disable almost all disk writing procedures”. However, did you actually remount the rootfs read-only? If not, then you are missing a step. If the filesystem refuses to remount read-only, then you’ve just proven that something is still writing.

FYI, unless you’ve added some sort of short emergency backup power, then the Nano is not designed to handle a sudden power off…the hardware will work fine, but the Linux operating system is what you need to worry about. Your use of large caps could in fact work, but unless your system has some sort of detect of power off, combined with a trigger of emergency write/flush, or better yet, all of that plus remounting read-only, then it is only “highly likely to work”, and not “guaranteed”.

Earlier I mentioned some special case hard drives. There are RAID controller solutions where a similar capacitor is involved, but those systems detect the power loss and run special operations for emergency flush. There is even an ability to use the energy from the spinning platters of a regular mechanical hard drive to work long enough to flush. You won’t find that on an SSD or SD card or eMMC, although enough capacitance will work.

I really want to emphasize the need to remount read-only upon emergency if you want a guaranteed safe shutdown. Capacitors go a long ways towards making that possible, but they are not a complete solution without software or firmware help.

What I would find interesting is to know if there is a way to get a checksum of the QSPI ROM, and compare between what it started with and what it ended up with. Unfortunately, I don’t know if it is possible to dump that ROM for checksum. Someone from NVIDIA might be able to comment on that.

Here is the events that happened:

  1. Started Nano with SDCard A, written with etcher, from let’s call it gold.img
  2. Nano boots, gets into ubuntu.
  3. Pull barrel connector, reinsert barrel connector around 2 seconds later.
  4. Green LED lights on board, nothing happens.
  5. Burn new SDCard, same gold.img. Insert, nothing happens.
  6. Try default starting image you download from the Jetson site, nothing.
  7. Try cards in second Nano, they boot fine.
  8. Install sdkmanager into Ubuntu VM on my Mac, flash in recovery mode, system starts up
  9. Insert gold.img card again, works fine.

In this case, something had to corrupt the QSPI boot ROM.

I absolutely agree that in the long run, you don’t want to leave anything writing to disk on a system that can be unplugged without warning. In this system, we cannot be guaranteed right now that something isn’t writing, but, when we deploy, we carefully cull out anything that would do any sort of updating (and any sort of disk writing).

People do embed linux for lots of projects that could be shut off without warning, but in those cases, like us, we prepare for it by setting up the operating system to not be reliant on needing to flush buffers (or, write data needlessly to the SDCard when we do not need any of the logs or anything)

And I do understand the difference in the architecture. I have deployed realtime OS with ARM Cortex M4 and M7 systems, so I get the difference.

Had this been a problem with something happening solely on the SDCard, I would be okay with it. Because I’d understand where that corruption came from. (And, how to mitigate it in the future).

The fact that this killed the QSPI boot ROM, I am trying to determine what could have caused that. Like, is there any sort of auto update that could run? This device had no internet access when this occurred. But, it was running a new SDCard which had a new OS on it compared to the previous SDCard that was in it. Given it booted the first time, is there some process that then broke the QSPI as it was updating something automatically? And it wasn’t done when power was removed?

Since I reflashed the unit to get it working again, that info is lost (the older content) but, if it happens again we can probably revisit.

Pertaining to the power down, I see that the circuitry is setup to assert SHUTDOWN_REQ* if power is lost (which then desserts POWER_EN). This is figure 5-5 in the product design guide. While I agree that if the amount of stuff to be flushes was too big, nothing could be done, the caps in the schematic are pretty large. I guess I don’t know what happens on the module when this occurs. Does it interrupt the OS and start it shutting down? Or does it know that it is a power loss situation and try and flush things ASAP before all power is gone?

I’m answering these in sequence, so parts will be before you mention items #8 and #9.

Item 3 could conceivably try to restart too soon from a system in some incomplete state. I don’t know, but perhaps waiting till all capacitors have discharged could make a difference to the power bring-up sequence.

Setting up SD cards tends to be error prone for subtle reasons. If you were able to boot both SD cards prior to the issue, then this is good evidence that something may have gone wrong on the Jetson itself. However, if this other SD card was not tested prior to this, then I would not consider this good evidence of failure. In the case that this is good evidence of failure, then I will suggest the QSPI memory has some sort of software corruption, and flash would be needed to the Nano itself, and not just the SD card. There is a slim chance that some actual hardware failure is involved, but if the Jetson can be placed in recovery mode, then I doubt it is a hardware failure.

Considering that you are using pre-designed SD card images, and not using JetPack/SDK Manager, then there is no chance that a corrupt QSPI memory would be corrected. You have to actually flash the Nano itself to set up QSPI memory.

Are you able to put the Nano in recovery mode?

Please note that a VM is not officially supported, although it can be made to work. The most common issue with a VM is that USB repeatedly disconnects and reconnects during a flash, and the parent o/s tends to not pass through the USB upon reconnect. However, if it worked for you, then there shouldn’t be a problem with using a VM.

So yes, you have found evidence that the QSPI memory had changed, but only if the new SD card worked prior to flash. The reason I say this is that each image has a dependency on what software is in the QSPI memory. If the new and old SD cards depended on different software releases, and thus different QSPI releases, then it might have failed even if nothing was corrupt. There’s the important question: Did the replacement SD card work prior to the incident? If so, then you have “smoking gun” evidence of QSPI corrupting.

I’ve not heard of the QSPI corrupting during a sudden power loss. If this is the case here, then it is a rare case. “It shouldn’t happen” probably isn’t much consolation. This is where I would ask NVIDIA if there is a way to take the checksum of the QSPI for later comparison to know if it has changed (if this is possible, then one would need to do this in recovery mode on a Jetson which does not have any of the fuses burned).

As far as lost content goes, if you mean the SD card itself, then this can easily be cloned into a file and the existing content examined or repaired. Should a partition (which is not QSPI) become unmountable due to some sort of filesystem corruption, then this has some steps which would be able to recover a lot of the content.

I don’t know what goes on internally with the Tegra SoC upon SHUTDOWN_REQ. I couldn’t say if this tries to force flush to any mass storage device, but if it doesn’t, then you are back to needing an alternative trigger to flush (remounting read-only would work, the trick is to trigger this prior to the software failing or stopping due to loss of power).

Perhaps NVIDIA could answer this: What happens with persistent memory flush when power loss hits? Is it up to the o/s to trigger flush, or does the SoC itself send out any commands to flush?

Yes, I was able to boot with the same card before the failure. The order was

  1. burn gold.img
  2. boot nano with gold (boots up no problem)
  3. power cycle
  4. boot nano with gold, no activity

Same SDCard in both booting attempts.

So took the gold card and

  1. put gold SDCard in another Nano, boots up
  2. burn another SDCard with same gold.img, try in original non-working nano, does not boot

So at this point, I feel the QSPI being corrupted makes the most sense.

Here is where I am a little uncertain of what happens in the nano. The original nano was running the generic IMG originally off the Jetson website before I tried the gold img, let’s called that jetson.img. The gold.img was a new version of the OS that we put together as we needed some additional drives in the device tree, so the timeline was moving backwards

  1. burn jetson.img
  2. boot nano and run through setup of ubuntu
  3. work this way for about a month, never any issues issues

Then the steps at the start of this message happened (as I wanted to run a newer image)

Does the QSPI get updated automatically if you try and run a newer OS? Could that update have been happening during the power cycle?

And second question, it would be nice to know what happens when the SHUTDOWN_REQ occurs. Is that something we can trace in the device tree? To see if there is any driver being loaded that handles whatever interrupt that pushes out? Not sure how to get Nvidia to see these questions.

I don’t know the particular details, but I suppose if there is boot content being updated, then boot could be broken by loss of power during such an update. However, so far as I know (and perhaps NVIDIA can verify this), QSPI is not updated via packages, and QSPI only updates in recovery mode flash.

There are things which can (very rarely) alter what is in persistent solid state memory, e.g., literally a cosmic ray might flip a bit once every month (more often at higher altitude), or some fluke corner case might alter something in the QSPI upon a power event, but this is all fairly rare. If the QSPI is more sensitive than most to a power event, then I suppose you could call that specific component “degraded” in comparison to specs.

I couldn’t tell you what happens with the SHUTDOWN_REQ relative to SD card flush, but probably someone from NVIDIA could. You can be sure the messages are seen, but there’s a rather heavy workload in forums, so it might take a bit of time for someone to get to it. Harder questions requiring research might take a bit longer as well.