My Jetson Xavier NX EMMC module crashes when I touch or move the physical board. It also sometimes crashes spontaneously.
Setup:
Jetpack 4.6 with desktop removed
Standard Dev Kit (reproduced the issue on several dev kits)
static aluminum fin heat sink - no fan connected
1 TB ssd drive mounted to /mnt/ssd/ using /etc/fstab
rc.local configured to increase usb buffersize using:
sudo sh -c ‘echo 1000 > /sys/module/usbcore/parameters/usbfs_memory_mb’
logs generated via serial console and dmesg --follow command. I’m new to capturing these logs so some of the logs may not be complete. We have several modules and dev kits and this is the only module that is experiencing this issue. The issue has been documented on multiple dev kits. The module does have a different thermal gap pad material between the chip and the heatsink, but the material has been tested and is not conductiveAny advice on determining whether this is a hardware issue would be appreciated. I am about to attempt to reflash the board to see if the issue persists.
I’m also wondering if you are using one of the power supplies provided by NVIDIA, or something else? I could see these causing a problem even if the hardware is ok:
Static electricity.
Some sort of ground loop causing a change in power delivery.
It seems very “overly sensitive”, and it seems like nothing more than the capacitance of being near it might be a problem (which could actually be hardware failure if it is that sensitive, but I suggest first considering if power delivery is correct).
I have used multiple power supplies provided by nvidia, and tested that the same power supplies and dev kit carrier boards work with other modules.
I have now reflashed the module twice with two different backup images (which work on other modules that we have) and still seen the issue appear.
It is possible that there was some hardware damage when the module was being shipped. Do you have any advice on how to further narrow down the problem to determine what might have caused it?
Does this issue appear on just the one module? If changing supplies and trying other units results in just that one unit failing, then it is probably RMA time. If any of the other units also have this problem, then it could literally be the wiring of the power socket having incorrect ground setup (there are inexpensive home power socket testing devices to say if it is wired correctly). Anything that is occurring with just the one unit means it is very likely hardware.
I don’t think shipping would cause this. It could affect an electrolytic capacitor if in shipping the unit were to freeze at extremely cold temperatures. Electrolytic capacitors themselves have limited life. But I don’t think there are any electrolytic capacitors on a Jetson.
Yes, so far the issue is limited to one module out of five that we have worked with. The only difference with this module was that it was shipped across the country, we used a different thermal gap pad on the chip to sink to the fins, and we flashed using a new backup image. But I reverted to the old image and still see the issue.
I’m guessing NVIDIA will recommend RMA, but it isn’t something I can be certain of. The only thing I can think of is if the thermal pad were too thick, and enough torque is added due to tightening cooling hardware down, then it might change the contacts. If you’ve glued this on, or not applied a lot of pressure on mounting points, then it is doubtful that this is related. Maybe a bent pin or marginal connector contact exists, though that is really stretching for an answer. It could even be a cold solder joint on a ground.