We are experiencing a critical issue in our devices equipped with the Xavier AGX 32GB which seem to be only related to the models containing the Hynix DRAM.
During the past week we have observed some of our recently-built devices containing the Nvidia Xavier AGX 32GB intermittently failing during boot, making some customers unable to use their device. We managed to reproduce the issue in our lab. Rebooting the device multiple times may clear the issue, but often times it doesn’t and device is unrecoverable. We have seen this issue both in devices running JP5 and running JP4 with the patch provided by NVIDIA to support the Hynix memory.
Based on our investigations to date, it seems the failure happens only on the Xavier modules containing the Hynix DRAM. Swapping the Xavier module by one with the Micron DRAM solves the issue.
We have hooked up the devices to a UART cable to gather more debug information. On most occasions the affected units are not able to boot past the 2nd stage of the bootloader.
Further investigation exposed that the Xavier module is shutting down shortly after powering up our board.
Power rails and power sequence to the module were checked.
When measured with an oscilloscope, 12V to SYS_VIN_HV and 5V to SYS_VIN_MV levels look good during the boot process. No noticeable differences between Micron based and Hynix based modules.
Note that we are using the EFM8BB21F16I microcontroller to handle the power sequencing, with firmware as provided by Nvidia and pinout according to the devkit reference schematics.
After SYS_VIN_HV and SYS_VIN_MV are at their respective levels, VDDIN_PWR_BAD_N is deasserted, soon after that MODULE_POWER_ON is asserted.
Then after some time - roughly 90ms - on Hynix based modules CARRIER_PWR_ON is asserted only briefly before being deasserted again, triggering the power sequencing uC to deassert VIN_PWR_ON and stopping the boot process. On Micron based modules CARRIER_PWR_ON stays asserted and the boot process proceeds as expected.
Questions we have:
Did anyone experience any similar issues wherein Xaviers with the Hynix DRAM display bootup problems, which are fixed by swapping to a Xavier with the Micron DRAM?
Is there any timing or trace impedance/capacitance related aspect on the Xaviers containing the Hynix memories which would make them more susceptible to tolerance ranges of synchronization or matching?
Is there a hard relationship between serial numbers of Xavier AGX modules which use Micron memory versus the ones using Hynix memory? For example, for some data we have received at the beginning of this year during the cutover, it seems that devices using Micron memory have serial numbers starting with 1423- and Hynix based modules starting with 1421-. So far, all units experiencing the described issue have a serial number that starts with 1421.
Assuming that;
the power rails providing the SYS_VIN_HV and SYS_VIN_MV are providing enough power and at the correct levels
VDDIN_PWR_BAD_N is correctly asserted
MODULE_POWER_ON is correctly asserted
What could cause CARRIER_POWER_ON to be deasserted by the Xavier SoM?
Is the module still be able to get flashed by latest jetpack from sdkmanager? Hynix DRAM shall get supported about 32.7.2. How did you use that on rel-32.4.4?
From hardware perspective, it looks more like your custom design issue then, may be caused by some margin design. Could you probe and share a full power on sequence as listed in the Design Guide, and compare to that of devkit? In addition, is there any device on your carrier will be powered before module power on? That could cause the shared interface pins status change during power on and so might cause boot failure.
We’re narrowing down on this issue, it seems like there is something going on around the AND port on our board that is used to combine the OVERTEMP_N and VIN_PWR_BAD_N signals. We’ll keep you posted on the outcomes.
One question we would still greatly appreciate an answer to in order to aid our containment and validation strategy is whether there is a hard relationship between serial numbers of Xavier AGX modules which use Micron memory versus the ones using Hynix memory?
For example, for some data we have received at the beginning of this year during the cutover, it seems that devices using Micron memory have serial numbers starting with 1423- and Hynix based modules starting with 1421-. So far, all units experiencing the described issue have a serial number that starts with 1421.
Part number can’t tell that. The 142- is just our product serial number, the memory info is not included in it. You can share the part number here. We can help to check what the memory is.
Also shared via email and sharing it here for completeness sake:
As you have confirmed earlier, the serial number is apparently not related to a module containing Hynix or Micron DRAM. However, there does seem to be something else in that range/batch that makes our board more susceptible. Our measurements currently lead to a suspicion that the 1.8V level on the OVERTEMP output seems to be slightly lower, in the order of 10s-100 millivolts, than with earlier modules. So not by much and still well within specifications for 1.8V CMOS, though just enough to trigger the issue we observe.
Without further ado, a number of the S/Ns of modules that we have found to trigger this issue on our devices: