Xavier NX Randomly reboots

Hi I purchased a Xavier NX and it wouldn’t boot. I replaced it and got a new one but am having similar issues (i.e. it randomly reboots) and took more than 2h to boot the first time. I would be surprised if I got 2 faulty Xaviers … but am not sure why this is happening and would love to receive some help here.

Thank you!

Could you monitor the serial console log and see what is printed right before the reboot?

https://elinux.org/Jetson/General_debug

Hi, we have similar issues with some Xavier NX. It often happens while pulling large image layers (> 660MB) from Docker Hub, mostly on the extraction step. Unfortunately, nothing is printed in the serial console, it just reboots. The SD card has enough space and there is also enough memory free.

Also we encountered the problem on multiple devices, so I don’t think that all are faulty.

@jeremyfraenkel: Could you solve the issue with you devices?
@WayneWWW: Any idea what could be the reason for this and how we can further debug the issue?

Hi,

If it does not print anything, then this is probably an issue in power supply.

Are you reproducing this issue on NV devkit?

Hi WayneWWW,

yes, it is happening on the Jetson Xavier NX devkit. Maybe it has something to do with the SD card. We experience the issue mostly with the Samsung EVO Plus 32GB (MB-MC32GA/EU) SD card. We now tried the WD Purple 32GB (WDD032G1P0C), which seems to behave better.
But, if it would be the SD card, I would expect some error output on the serial console. We also checked if there are voltage drops from the the power supply, but it looks quite well.

Do you have any idea how to further debug this?

If this is NX devkit, are you able to reproduce this issue in every NX you have?

Could you share how to reproduce your error if I want to try on my device?

We can reproduce this issue with multiple devkits equipped with the Samsung EVO Plus 32Gb. With devkits equipped with the WD purple, we didn’t see this problem yet.

As mentioned before, it happens while pulling large docker images. It doesn’t happen all the time, so maybe you have to repeat it. For example, you could run the following commands multiple times:

docker pull 3dvl/hemistereo-base-dev:jetson-xavier-nx-r32.4-latest
docker system prune -a

Mostly on the extraction step of some layers, the device reboots.

Just speculation: I think the 970 Evo Plus is a very fast SSD, and the amount of power it requires will depend on what operations it is performing, especially write bandwidth. Do you happen to know if the SSD is running extensive operations at the moment of reboot? Perhaps it is just pulling more power and manages to use a bit too much power.

Hi linuxdev,
it happens with the Samsung EVO Plus SD card, not SSD.

I was not aware the there was a SD card with that name. I am thinking of the M.2 slot. However, the speculation would be the same: Under high loads (e.g., long reads or writes), it is possible that power consumption goes up due to the 970 EVO, and if this pushes power consumption over some limit, then normally things would work well. I wouldn’t know how to test that though other than asking if you think there might be heavy read/write activity to the 970 EVO at the time of failure. Or even heat. Just by touch of finger, is the 970 EVO warmer than normal at the time of failure? It’s very close to other components.

@linuxdev i have a custom baseboard that auto reboots sometimes when plugging in a device like a USB or HDMI. As I have a Samsung EVO Plus SSD + Fan + USB C + HDMI, perhaps the power consumption is going over a limit. Might it help if I bought a 5A power supply and increased the max current of the NX to 5000ma rather than the default limit of something like 3600ma?

Is there a less “power hungry” m.2 SSD you can recommend?

Is it correct that the Samsung EVO is supplied power by some connection to the carrier board? If so, then this is probably an issue even if you increase power available to the Jetson itself (though increasing power available to the Jetson is a good idea). Different power delivery methods (e.g., via USB or via PCIe socket) have limitations, and if those limits are being reached, then perhaps this would cause a reboot. I think in theory this would not be a problem per se, but it would be a very good test if you could actually supply the SSD externally. If not, then the increased power unit is likely the next best test. Note that if regulation is a problem, then even if a power supply has sufficient average power, then there would be a reboot…a higher output current supply would likely take less of a hit to regulation stability compared to the lower max current supply.

I do not have any SSD recommendations, but other people on this forum probably can make suggestions. The trick here being that you need m.2 form factor.