Jetson tx2 devices rebooting unexpectedly

Hi there,

Thank you in advance for any help.

Brief description of the situation:
We have a number of Jetson TX2 devices working in the field and some of those are presenting instability as they are rebooting every hour more or less.

Some facts we’ve learnt so far:

  • Reboots are happening in a somehow predictable cycle (every 1:15 or 1:45 hours). We have cron jobs running at 15 minutes intervals.

  • Reboots seem to be related to some I/O activity in an external drive we have for storage, if we disable that feature the reboots cease.

  • Some devices work ok even with that external drive I/O, some others are heavily affected, all devices are in the same facility, all within 200mts range.

  • They are operating in sub-zero ( C ) temperatures most of the time.

  • The reboots seem to be sudden as there is no shutdown signature seen in the syslog. We’ve made planned reboots and the shutdown sequence is present, however when the node reboots there is no sign of an orderly shutdown.

  • Also, normal manually triggered reboots take longer than unexpected reboots (maybe for the absence of the shutdown process).

  • There are 2 power units, one is 12v and the other 24v (this last for other devices). Both have 120W capacity, that seems to be ok.

  • The problem seems to be controllable from the software, depending on having engaged or not that file operations to the external hard drive.

$ head -1 /etc/nv_tegra_release

R32 (release), REVISION: 3.1, GCID: 18186506, BOARD: t186ref, EABI: aarch64, DATE: Tue Dec 10 07:03:07 UTC 2019

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic

Errors in syslogs that caught our attention on affected nodes:
Screenshot 2020-12-17 at 10.33.20

Any help would be greatly appreciated.
Regards.

Are the hard drives powered by the TX2, or by an external source? Is the power to the TX2 well regulated and isolated from other non-TX2 power consumers? I can easily see power fluctuations (much smaller than what most people would consider) as a cause, and if the hard drive consumes power at the moment the software runs, then this would a good possibility. Not all drives will consume the same power, and some of this depends on data as to power draw, plus some TX2s may be more tolerant of a certain level of power change versus others.

As an experiment, if drives are not externally powered, or if there is anything else connected which might draw significant power (you could exclude mouse/keyboard), then you might find out what happens if these become externally powered. Or simply moving other devices consuming power off of that battery for testing.

FYI, that regulator message is power related as well, and it applies to the SD card. However, there are other regulators with error prior to that, so this sort of points at power delivery. Power delivery might be an onboard issue due to circuits, or it might be an external issue if something draws power at the wrong level or timing. So it all sort of fits together.

1 Like

Hi linuxdev

I really appreciate your comments.

From your insightful reply, I can see that power delivery would be something to aim at in this case. I start to see the picture here.

Please allow me to answer some of your questions:

Q) Are the hard drives powered by the TX2, or by an external source?
A) The hard drive is being powered by the TX2 device, through USB3.

Q) Is the power to the TX2 well regulated and isolated from other non-TX2 power consumers?
A) It seems to be the case. The only device sharing the power supply is a router is a minimal power consumer (3.0 W). All the rest is taking power from the other 24v power supply.

One thing maybe worth mentioning though is the fact that we are mounting and unmounting the harddisk continously every time we use it (say every 15 minutes), unsure wether that could cause any issues.

We almost can be certain that HDD activity plays a role here, however I have some questions I’d like to share:

  • Why 2 identical systems with same TX2 device, same load and same hard drive model would have 1 of the systems failing and constantly rebooting and the other not? Both meters away from each other?

  • Also why the failing system would decide to suddenly fail after months of running normally? Could a OS or SW update have a role here? We had auto-update disable but we’ve got some issues with other software being updated with bad results, maybe some nodes are half way through.

And finally:
Should we try to fix those regulators messages too?

Your help is much appreciated.
Best regards,

Hi,

If this board “reboots” but not “power down”, then it is probably triggered by some kernel panic which is not recorded in syslog. Are you able to get the uart log from these boards? It would be the most precise way to check what driver goes wrong.

What kind of work is triggered by the cron jobs?

Is it possible to operate this in higher temperature environment? This is only for debug. Want to see if temperatures are the factors.

I don’t think you need to resolve the regulator issue. They also appear on the devkit too.

According to the log you shared, it looks like a custom carrier board. Is it possible to use same usb drive on devkit to reproduce this issue?

1 Like

Hi WayneWWW

Thank you for your comments, really appreciated.

Agreed, we seem to be having hard resets. I’ll manage to get field technicians grabbing those UART logs, I believe the nodes need to be rebooting themselves by the time the UART logs are grabbed, otherwise the panic records might be lost (nodes aren’t behaving like this consistently). I’ll share anything we have as soon as it gets available.

Regarding the jobs triggered by the cron jobs, they are pretty intensive file copying from the Jetson device into the external hard drive. This activity is clearly correlated with the unexpected reboots. Nothing special though, just I/O related to export files into the external HDD.

I’m not sure if temperature plays a role to be honest, very close to failing nodes there are perfectly fine nodes all sharing more or less the same environmental situation. As a matter of fact, this is also happening (to less extent though) in different regions where temperature is well above zero (15C+).

Regarding the regulators issues: got it, thanks for the clarification.

Yes, we’re trying to reproduce this situation in the lab, hopefully that would bring some more conclusive information.

Just as a side comment, as I said earlier, we’re connecting the auxiliary external drive where we’re copying the files, using one of the USB 3.1 sockets available in the device. As far I can see this should be pretty standard setup that shouldn’t be creating any overload or pressure to the device, let alone making it restart in a catastrophic way.

Thanks again for your thoughtful comments. Your help is much appreciated.
Regards,

I didn’t see a reply to @WayneWWW’s question: Is this a reboot, or a shutdown? I suppose with a custom board it might be a shutdown, and then custom setup could cause it to boot again. Is the custom carrier board designed such that if the power were to cause a shutdown from a brief spike, then it would boot again? The question of whether it truly reboots or switches off is rather important, but if the carrier board has some feature to make sure it boots again even if it powered off, then a distinction would have to be made. Narrowing down whether it was a kernel panic or a hardware shutdown is very important.

For the USB hard disk, is it possible to use it with a powered HUB, even if only for testing? Is this an old style mechanical drive, or is it solid state? Is the disk put in any kind of power savings mode between uses? Is USB actually disconnected and reconnected upon use?

Just some general comments on old style mechanical hard drive behavior: At startup they draw a surge of power to get the spindle up to speed. Long read or write operations will change power draw depending on how much head movement is required. A long operation where the data is more or less contiguous requires less power (due to less head seeking) than does one where the data is fragmented and needs more head seek operations. Sometimes fragmentation (and thus power used for head seek) is different based on filesystem type. What filesystem type is the drive using? Ext4? NTFS? VFAT?

This is kind of stretching the imagination, but depending on how you answer about power savings modes, I am thinking that if some odd circumstance does not give the drive time to become ready, then this might cause a kernel panic and software error by mounting or or read/write operations.

What is the exact sequence and timing for any kind of drive wake, mount/umount, and other operations? Do logs show a need to fsck repair the filesystem? After one shutdown during write there would be data damage…if the damage is small enough the journals recover, then it implies no fsck is needed, but part of the data would be missing. If enough data was being written at the time of the first shutdown, and the journal is not large enough, then an fsck would be required (and even more data would be missing). Any time there is a repair or journal operation the time to access the drive would be increased. How well does the software test that the drive is truly ready? The very first failure might lead to a chain of failures if this results in trying to use the drive too soon.

Just to reiterate, a lot of power issues could be ruled out if the problem is the same with a powered USB3 HUB isolating the USB hard drive from drawing power directly off of the Jetson.

NOTE: Might also be interesting if the USB3 is in autosuspend. One could test with autosuspend disabled.

Hi linuxdev,

Again thank you for your help.

At the minute I’m focused in trying to capture the panic that is happening in order to have a concrete pointer to where the panic is originated in the first place. My idea is that I could see the panic origin I’d be in a better position and I’d stop guessing.

I’m trying the Kexec/Kdump approach which seems to be working fine in an Ubuntu VM I’ve created for this. However I’m having some tough times trying to port this to the Jetson world as it uses U-Boot as bootloader (ARM).

So far I couldn’t find a proper guide to implement this. Do you have some pointers for Kexec/Kdump in Jetson TX2 based device?

Thanks in advance!
Regards,

I do not have any useful knowledge for observing this with kexec/kdump. I have not personally set this up.

I still suggest trying with the USB disk going through a powered HUB. If there is a power issue, then any kind of dump from the kernel isn’t going to point you to anything other than what software was running at the moment power became unstable. Perhaps it is a software issue, but if so, then you’d still want to isolate the disk power draw to something external during the test.

Hi All.

I’d like to comment that I finally found the reason for these mysterious reboots.

We had some USB transfer speed issues so the net bandwidth was really low, thus file operation to external HDD were taking way too long. That made that the kernel thought that something really bad was going on, then panic then reboot.

The fix finally was:
sudo sysctl -w kernel.hung_task_timeout_secs=300

just extending that tolerance was enough to control the reboots.

I hope this might help someone else going throw this issue.

Thanks for the help given.
Regards,