We are experiencing a critical issue in our devices equipped with the Xavier AGX 32GB which seem to be only related to the models containing the Hynix DRAM.
During the past week we have observed some of our recently-built devices containing the Nvidia Xavier AGX 32GB intermittently failing during boot, making some customers unable to use their device. We managed to reproduce the issue in our lab. Rebooting the device multiple times may clear the issue, but often times it doesn’t and device is unrecoverable. We have seen this issue both in devices running JP5 and running JP4 with the patch provided by NVIDIA to support the Hynix memory.
Based on our investigations to date, it seems the failure happens only on the Xavier modules containing the Hynix DRAM. Swapping the Xavier module by one with the Micron DRAM solves the issue.
We have hooked up the devices to a UART cable to gather more debug information. On most occasions the affected units are not able to boot past the 2nd stage of the bootloader.
Further investigation exposed that the Xavier module is shutting down shortly after powering up our board.
Power rails and power sequence to the module were checked.
When measured with an oscilloscope, 12V to SYS_VIN_HV and 5V to SYS_VIN_MV levels look good during the boot process. No noticeable differences between Micron based and Hynix based modules.
Note that we are using the EFM8BB21F16I microcontroller to handle the power sequencing, with firmware as provided by Nvidia and pinout according to the devkit reference schematics.
After SYS_VIN_HV and SYS_VIN_MV are at their respective levels, VDDIN_PWR_BAD_N is deasserted, soon after that MODULE_POWER_ON is asserted.
Then after some time - roughly 90ms - on Hynix based modules CARRIER_PWR_ON is asserted only briefly before being deasserted again, triggering the power sequencing uC to deassert VIN_PWR_ON and stopping the boot process. On Micron based modules CARRIER_PWR_ON stays asserted and the boot process proceeds as expected.
Questions we have:
Did anyone experience any similar issues wherein Xaviers with the Hynix DRAM display bootup problems, which are fixed by swapping to a Xavier with the Micron DRAM?
Is there any timing or trace impedance/capacitance related aspect on the Xaviers containing the Hynix memories which would make them more susceptible to tolerance ranges of synchronization or matching?
Is there a hard relationship between serial numbers of Xavier AGX modules which use Micron memory versus the ones using Hynix memory? For example, for some data we have received at the beginning of this year during the cutover, it seems that devices using Micron memory have serial numbers starting with 1423- and Hynix based modules starting with 1421-. So far, all units experiencing the described issue have a serial number that starts with 1421.
- the power rails providing the SYS_VIN_HV and SYS_VIN_MV are providing enough power and at the correct levels
- VDDIN_PWR_BAD_N is correctly asserted
- MODULE_POWER_ON is correctly asserted
What could cause CARRIER_POWER_ON to be deasserted by the Xavier SoM?
Thanks in advance for your support!