Jetson AGX eMMC corruption across multiple units

I am seeing eMMC corruption / instability across multiple Jetson AGX units running from the internal built-in eMMC (no SSD/NVMe attached). I am trying to understand whether this behavior is expected given typical eMMC endurance, or whether there may be an underlying issue worth investigating.
The units have been used in long-running deployments with continuous application activity, logging, and background processes. Over time, several devices became unstable and now reboot or panic during write-heavy operations (package installation, sync, dpkg, etc.) or don’t boot at all.**

Write volume measurement

I collected disk write statistics over a ~24-hour period using cumulative write counters.

Example (one device):

  • Start disk writes: 20.668 MB

  • End disk writes: 8644.556 MB

  • Total writes over ~24 hours: 8623.888 MB

This corresponds to approximately 8.6 GB of host writes per day.

eMMC lifetime estimation

Using a simple endurance estimation model:

Expected lifetime (years) =
((Disk capacity × P/E cycles) / (Daily writes × Write amplification)) / 365

Assumptions:

  • Disk capacity: 32 GB

  • P/E cycles: 1000

  • Write amplification factor: variable (1–4)

Results:

Write Amplification Estimated Lifetime (years)
1 ~10.17
2 ~5.08
3 ~3.39
4 ~2.54

Even with moderate write amplification (2–3×), the expected lifetime should still be multiple years, yet I am observing eMMC corruption and boot issues much earlier.

Observed behavior

Across affected units, symptoms include:

  • Sudden reboots or kernel panics during write-intensive operations

  • MMC controller errors reported in kernel logs (e.g., “RED error” events)

  • Failures occurring even after disabling CMDQ / blk-mq and reducing MMC features

  • Inconsistent ability to read eMMC health information (some units allow reading EXT_CSD, others reboot before tools can run)

On at least one unit, mmc-utils reports:

  • LIFE_TIME_EST_TYP_A: 0x01

  • LIFE_TIME_EST_TYP_B: 0x09

  • PRE_EOL_INFO: 0x01

Other units show similar instability but different failure characteristics.

I would like clarification on:

  • Whether the internal eMMC used on Jetson AGX is intended to sustain this amount of writes over multi-year deployments.

  • Typical write amplification factors assumed by NVIDIA for AGX eMMC endurance calculations.

  • Whether these failure signatures are known behaviors of late-life eMMC on Tegra194.

  • Recommended mitigations or design guidance beyond “use external NVMe,” especially for deployed systems.

Hi smartcity.updater,

Are you using the devkit or custom board with AGX Xavier module to verify internal eMMC?
What’s the Jetpack version in use?

Please share the full log as file here when you hit the issue.
Could it be recovered through the re-flashing?

We would like to know the detailed steps how do you perform the stress testing for the eMMC.

Custom board.

Affected versions:
JetPack 4.6 (L4T 32.6.1)
JetPack 4.6.2 (L4T 32.7.2)
JetPack 5.1.1 (L4T 35.3.1)

Sometimes reflashing helps, but usually not
Probably depends on mmc report where its not yet in EOL mode
For example:
PRE_EOL_INFO: 0x01
Life Time Est. Type A: 0x01
Life Time Est. Type B: 0x09

But wont reflash if completely corrupted.

Attaching different logs.

Currently the testing is just a normal operation with different software services running and iotop -oPa to monitor disk writes.

log1339.txt (236.1 KB)

log1.txt (66.3 KB)

log1461.txt (115.8 KB)