Critical PMIC failure on Jetson Orin NX 16GB and unable to recover

Hi,

I am unable to boot, nor reflash one of my Jetson Orin NX 16GB devices, which seems to be caused by a hardware failure. Would it be possible to shine some light on what could have caused this to happen (transient, ..), and if there is any way to recover this product?

Unable to boot

After re-plugging one of our custom carrier boards, suddenly the Jetson Orin was stuck in a PMIC failure. You can find a snippet of the logs below (reproducible every boot) and the full logs below that.

[0000.762] I> Task: Update FSI SCR with thermal fuse data
[0000.767] I> Task: Enable WDT 5th expiry
[0000.771] I> Task: I2C register
[0000.774] I> Task: Set I2C bus freq
[0000.777] I> Task: Reset FSI
[0000.780] I> Task: Pinmux init
[0000.783] I> skipped mmio_addr = 0x9240008
[0000.787] I> skipped mmio_addr = 0x9240000
[0000.791] I> skipped mmio_addr = 0x9240010
[0000.795] I> skipped mmio_addr = 0x9240018
[0000.799] I> skipped mmio_addr = 0x9240020
[0000.803] I> skipped mmio_addr = 0x9240030
[0000.807] I> skipped mmio_addr = 0x9240028
[0000.811] I> skipped mmio_addr = 0x9240038
[0000.815] I> skipped mmio_addr = 0x9240040
[0000.818] I> skipped mmio_addr = 0x9240048
[0000.822] I> skipped mmio_addr = 0x9241000
[0000.826] I> skipped mmio_addr = 0x9241008
[0000.830] I> skipped mmio_addr = 0x9241010
[0000.834] I> skipped mmio_addr = 0x9241018
[0000.838] I> skipped mmio_addr = 0x9241020
[0000.842] I> skipped mmio_addr = 0x9241028
[0000.846] I> skipped mmio_addr = 0x9241030
[0000.850] I> skipped mmio_addr = 0x9241038
[0000.854] I> skipped mmio_addr = 0x9241040
[0000.858] I> skipped mmio_addr = 0x9242000
[0000.862] I> skipped mmio_addr = 0x9242008
[0000.866] I> Task: Prod config init
[0000.869] I> Task: Pad voltage init
[0000.872] I> Task: Prod init
[0000.875] I> Task: Program rst req config reg
[0000.879] I> Task: Common rail init
[0000.883] I> DONE: Thermal config
[0000.887] W> DEVICE_PROD: module = 13, instance = 4 not found in device prod.
[0000.895] E> I2C: Timeout while polling for bus clear. Last value 0x00000000.
[0000.903] E> I2C: Failed to clear bus for instance 4.
[0000.909] W> DEVICE_PROD: module = 13, instance = 4 not found in device prod.
[0000.916] E> I2C: Failed to clear bus for instance 4.
[0000.921] E> I2C_DEV: Failed to initialize instance 4.
[0000.925] E> pmic: Can’t get handle to i2c device @4
[0000.931] E> PMIC_CONFIG: Failed to initialize Rail: SOC rail config.
[0000.937] C> Task 0x2b failed (err: 0x57571c1c)
[0000.942] E> Top caller module: PMIC_CONFIG, error module: PMIC_CONFIG, reason: 0x1c, aux_info: 0x1c
[0000.951] C> Boot Info Table status dump :
011111110011100011111111111111111111111111100000000000000000000000000000000000011111

boot-failure.log (9.6 KB)

Unable to recover

After this boot failure happened, the Jetson Orin NX could not enter Recovery Mode anymore. Simply pulling FORCE_RECOVERY_N to GND, and powering the Jetson Orin NX, does not list the usb device using lsusb, even though there are no Jetson Orin NX logs (which indicates recovery mode). There are no dmesg events.

Is this a known issue? I see that a similar topic was closed before: https://forums.developer.nvidia.com/t/unable-to-get-into-force-recovery-mode-orin-nx-8gb/315051.

Any help in the matter would be greatly appreciated.

Sincerely,

want to clarify that so this issue could not reproduce on NV devkit ?

Hi,

I can confirm this issue is exactly reproducible on the Jetson Orin Nano devkit.

Which Jetpack release are you using ?

Latest one, Jetpack 6.2.1, Jetson Linux 36.4.4

I don’t think this is any known software issue but a hardware problem.

If this issue is a hardware failure of the NVIDIA SoM, what mitigation or prevention measures would you recommend?

Additionally, are there any known causes for this behavior when the design guide has been followed correctly?

Did you ever use this module before and this error happen suddenly ? or it is a totally brand new SOM?

We successfully used this SoM for at least a month before this issue suddenly occurred.

Hi,

To clarify: We put another SoM in the exact same carrier board where the previous SoM suddenly failed, and that seems to still work (so the problem is likely not related to our carrier board). Could you please elaborate on the issue and if a known solution exists to prevent or resolve this?