Xavier AGX eMMC after abrupt power removal?

Hello,

We are wondering if the Xavier eMMC could suffer a problem if power is removed suddenly.

Our design allows an operator to forcibly remove the battery if they try hard enough. We’ve already seen several instances of abrupt battery removal because our testing team can be aggressive with batteries and we have test-cases we exercise.

We expect our Xavier AGX in-house carrier-board design to provide a (very) few milliseconds after a battery is suddenly removed. We also ran an interrupt line to Xavier for notification purposes. We have a separate, large RTC battery but there’s only a tiny amount of power there.

If Xavier eMMC was being written-to and we have a few (3-8) milliseconds to prevent further enqueuing, is that enough time for the eMMC queues to clear properly? Even if wear-leveling was in-progress? Is there a recommended way to measure this? Is there an Nvidia-recommended “abrupt shutdown” mechanism we should be implementing? Is there any damage possible to eMMC?

We have been using Linux “fio” to measure the completion latency of transactions.

Thank you.

A related question, since we’re receiving an interrupt when the battery has been removed, is how to effectively disable the Carmel cores, the GPU, and the ISP very quickly (microseconds). We suspect we can route the interrupt to the SPE to have the SPE do realtime management when the power is collapsing. We are not yet using the SPE for anything in our design.

Is there any clear documentation for how to stop the various Xavier cores really quickly?

I only have some partial advice. If you are surge protected, then there should be no hardware damage, but Jetsons are subject to hardware failure with power surges.

In terms of writing to disk, if the amount of data being written and currently outstanding without yet having been flushed is small enough, then you would lose that data upon reboot as the journal replays. However, the filesystem would not corrupt and would not need manual intervention. How valuable the content lost while being written is I can’t predict.

If the amount of outstanding unwritten data is large enough, then the journal would not be able to delete that content to such an extent that the filesystem would become corrupt and require an fsck operation (probably manually run). If your system goes down under heavy writing, then this is far more likely than during some normal or “average” or “typical” operation.

To guarantee no loss you would have to either (A) operate synchronously (which would cause the solid state memory to fail very quickly and vastly decrease performance), or (B) flush and remount the filesystem read-only. This latter method probably can’t be done in a few milliseconds, but is pretty much a guarantee that nothing is lost and no harm would be done in any way.

This thread contains a part on using Magic SysRq to set up the filesystem read-only:
https://devtalk.nvidia.com/default/topic/1067644/jetson-agx-xavier/jetson-xavier-cloning/post/5424573/#5424573

Magic SysRq is probably the fastest method there is for emergency remount as read-only. Note that if you were to skip the sync part, then you could freeze further changes and reboot sooner, but you’d be back to the part where you might need filesystem repair.

If you are surge protected, then there should be no hardware damage, but Jetsons are subject to hardware failure with power surges.

Would you elaborate on this point about power surges? We’ve had a low, but non-zero, number of externally-attached pieces of Xavier SoCs “go bad” during extensive reliability testing.

For the eMMC, there is some amount of wear-leveling happening:

Our concern would be what happens if the eMMC abruptly loses power while updating the wear-level tables which map the flash blocks to the “physical blocks” exposed to the Linux filesystem. It’s possible that there is always enough power left in the system during power collapse for the eMMC to finish writing. Or maybe not.

Hi, in any situation, the power down sequence (as listed in OEM DG) should be guaranteed by custom design. It means some capacitors as that on dev kit carrier board should be implemented to make power down sequence fully accomplished. As you can see in chapter Dv/Dt Circuit Consideration and Power Loss Detection in OEM DG, it is important to maintain the necessary supply for the sequence. Besides that, also should avoid sudden power loss during the write process for any data corruption.

Software can fail from corruption after a sudden shutdown, but this does not harm hardware. Perhaps software corruption would require flashing again or some sort of rescue operation.

Power surges are from spikes in voltage reaching the Jetson. This can actually damage the Jetson. Most likely it would damage power delivery circuits on the carrier board. This would require hardware replacement or repair. Possibly also software flash, but replacing the carrier board in such a case might allow the module to just “keep running normally”.

Those surge protector strips you can get for your standard PC are a good idea if you are running your Jetson off of an AC adapter and need reliability. Lighting storms or other conditions could damage the hardware without such a surge strip.

However, manually yanking the power in the wrong way can result in voltage spikes. It depends on the nature of how the power is removed. If there is any kind of inductance in the power delivery to the Jetson, and the power cord is simply yanked, then in some cases the inductance will result in a large voltage spike. Followed by “bad things”…perhaps a fried carrier board.

A good power delivery setup, such as from a battery, but the battery being charged continuously, would eliminate many power spike failures. If the battery and its regulator perform well, then spikes to the charging circuitry would be isolated and yanking the power by pulling the wire would have minimal consequences. The reason being that the battery and regulator are non-inductive, and the AC power supply inductance is absorbed by the battery without being passed through.

For a more intuitive description, imagine if you have a desktop PC, and for testing, rather than holding the power button down to shut off, you yank the main power cord. Do that several times and your PC might no longer boot.

Those surge protector strips you can get for your standard PC are a good idea if you are running your Jetson off of an AC adapter and need reliability. Lighting storms or other conditions could damage the hardware without such a surge strip.

No AC needed. Our Jetsons fly. By themselves. Sometimes in rainstorms, too. :)

However, manually yanking the power in the wrong way can result in voltage spikes. It depends on the nature of how the power is removed. If there is any kind of inductance in the power delivery to the Jetson, and the power cord is simply yanked, then in some cases the inductance will result in a large voltage spike. Followed by “bad things”…perhaps a fried carrier board.

Yes, that makes sense. Thanks.

This actually adds a different problem: Air moving past metal surfaces can generate static electricity. I used to have a small antenna on a roof, and during windy/lighting conditions, the connector was constantly arcing from plug center connector over to the shield wire. On the other hand, if this happened in your case, I’m sure it would fry something rather than just causing a bad shutdown.