Structural reliability issues with Nvidia Jetson TX2 units

Hi there,

I’d like to ask around about an issue we’re experiencing with the Nvidia Jetson TX2 units. We use these units commercially and have about 80 TX2s running non-stop in the field for some time now (for almost a year).

The issue our customers experience is the sudden ‘disappearance’ of TX2 units from the network. Further investigation of software and OS logs show that the unit is completely dead/frozen during the disappearance period. A power cycle brings the unit back to life as if it were just shut down before.

The symptoms, as they were recorded up until now slightly seem to point in the direction of the power supply. E.g. as if a slight fluctuation in the voltage causes the TX2 unit to lock up completely. Even though we use very reliable power supplies, the issues keep coming up. Perhaps anomalies in the power grid before the power supply causes the power supply to dip/surge a bit – but we are not sure about this. Below I’ve listed the power supplies we use in the field.

Some of the reasons our gut feeling makes us think the power supply / power handling of the TX2/carrier boards is the source of the issue:

  • The units abruptly and completely disappear from the network.
  • We've had some experiences in the early stages of development of our product when we used cheap adapters (rated 12V/15W, i.e. on the virge of peak power usage). Sometimes our TX2 units would just suddenly turn off without an apparent reason (and sometimes even wouldn't turn on for a while -- which was very strange). When we changed to proper high wattage power supplies these problems went away.
  • We've had a situation where a customer used the unit in combination with a solar-powered power buffer. This power buffer would sometimes 'disturb' other local devices in the same power grid. We saw the issue with this unit often, and the issue went away when the unit was placed in another power grid position.

Now, the first thing that’s not clear is whether the source of the issue is the TX2 unit itself, or the carrier board. We use two kinds of carrier boards in the field:

Some of the power supplies we use:

Some other notable remarks:

  • It happens to any unit, not just a few single ones that always fail.
  • The units live in the same network and fail intermittently, i.e. there seems no relation to events in the network.
  • The issue has occured in about 15%..20% of all units, which is quite devastating for our reliability figures.

Is there by any chance anyone who has seen similar issues with the TX2 units? Or, can someone point us in the right direction as to where to search for the culprit? Any help is appreciated!

Thanks in advance,

Kris

Hi kris.van.rens,

Is there any log be attached here to do further investigation?
Any chance to repro this issue on original TX2 devkit?
What’s the BSP you used on those TX2?

Hi, thanks for your reply!

Do you mean a log as a “log file”? Because the only thing worth mentioning here is that the system is just running normally, and then when the disappearance issue occurs the logging stops (as if someone were to pull the plug on the unit). There is no relevant logging of the issue event itself – or at least not that we’ve found so far.

We have not been able to reproduce the issue with the development kit thus far. The issue seems very hard to reproduce anyway. We have an array of local duration test units, all of which run on the aforementioned power supplies, and we’ve never seen the issue in these units. They just have been running reliably for many months on end.

The only thing we’ve seen locally is the case with the under-qualified simple 15W adapters (as described in the issue report). They seem to induce similar behavior as described issue, and sometimes the unit is not able to wake up for several minutes. But given the fact that the adapter is an under-qualified power supply, this hardly seems like a valid reproduction case.

The legitimate issue cases happen in the field, where units are placed in remote areas, in closed cases, in far away countries :-(

We’re a software company with a strong electrical engineering background. We can reason about the issue, but unfortunately we don’t have the tools to reproduce complex power wave forms or the like.

We used the BSP versions as provided by the carrier board supplier.

Some units are running an installation based on JetPack 3.1, some use JetPack 3.2.1, most use a version based on JetPack 4.1.

However, there seems to be no correlation between any of these versions, as units of all versions have seen the issue.

In cases where logging stops you may actually see output from a serial console which logs do not show. If you can you might try to run serial console on a unit where you suspect a failure might occur. Serial console shows a certain amount of log detail already, but for example it could also run “dmesg --follow” (just don’t forget to start the console logging since usually the buffer is not large).

Thanks for getting back to me so quickly, and my sincere apologies for not responding.

We’ve been digging into the issue more and it is very likely related to power supply voltage instabilities. We’re currently evaluating the use of DC/DC converters for the problematic situations. It is very interesting to see how the TX2 reacts to input voltage drops and peaks.

I will get back to report about the eventual solution.

FYI, I would expect the 2.1A supply could fail to provide enough current at peaks. You might still consider placing a large capacitor right at the connector (and connectors themselves sometimes gain resistance over time from thermal creep if temperature slowly changes on a daily or seasonal basis). Any PCIe or USB devices can also change supply demands.

If you have a unit (or several units) where you believe there will be a failure, then if you could, I would literally keep a serial temrinal running on one (or more) TX2s with “dmesg --follow” running continuously. The log of what goes on right at the moment of failure would be very useful…or if serial console logs nothing, then that too would be useful. I’m just guessing, but if nothing gets logged, then I’d think power supply issues as a cause becomes more likely.

That’s actually a good point. We considered the capacitor already, and will definitely test this.

Office experiments with the power supply showed that nothing is logged. We’ve installed continuous extensive logging of resources already, this has not led to any more insights yet. It’s just like the systems simply ‘disappear’.

Again, thanks for the help!

Is the logging via serial console? This is the part which runs under the most severe circumstances. If your system fails to log even from serial console, then this too is a clue. One possibility if this is serial console logging is to add and run “htop” or other monitoring on serial console…and then when the failure occurs the last recorded monitor of information will still be present on the serial terminal. “dmesg --follow” would be one candidate. Unfortunately there is only one serial console, so you can only monitor one thing at a time this way.