Hoping for troubleshooting tips for Xavier NX PoE devices exhibiting intermittent eth0 behavior and sudden reboots

We’re running Jetpack 4.4.1 Xavier NX SoMs that were flashed via massflash. Our devices are in aluminum enclosures, powered by PoE on a custom carrier board, and run OAK-D cameras off USB3.

In testing before assembly no issues were found. I performed tests to confirm that on the switches we use (Unifi) shutdowns occur when exceeding PoE+ 25.5W power spec which can be triggered when putting a heavy load across NVMe+GPU+USB. Pretty confident we’re not encountering this condition here. We have assembled devices that are having two kinds of issues:

  1. eth0 goes down with Mar 15 22:15:08 localhost kernel: [ 2325.075853] eqos 2490000.ether_qos eth0: Link is Down. It then comes back up randomly afterward.

  2. Random shutdown seen on some devices, without any hints of issues observed in syslog or kern.log, it just picks up from a fresh boot.

Are there services/kernel features/tools we can use to monitor what might be going on? It does seem like there may be multiple issues and potentially multiple causes. We will be continuing to test different switch hardware, ethernet cables, etc., the standard approach to isolating problematic components within a system. I’m also going to set up a DSO to make signal integrity measurements on various interfaces during runtime if that somehow doesn’t lead anywhere.

Just hoping to learn if there are any tips and tricks that could be employed in addition to swapping around components.

As I continue troubleshooting, I will try to increase kernel log level: https://linuxconfig.org/introduction-to-the-linux-kernel-log-levels


Check the log from the uart.

