TX2 failing to boot after several months of usage (ERROR: Invalid GPT)

Hello all,

We have had several TX2 modules (on Aetina ace n622) fail after several months of usage. Timeline is not consistant and the module can fail after 6 months or up to 12 months.

I have saved a typical boot log file, see below.

log (1).txt (26.5 KB)

Is this a memory issue? Does the EMMC go bad after a (short) while? Can this be avoided altogether by using an external storage device to save stuff to (instead of having it saved on the EMMC)?

Thanks for the input!

Are you able to reflash the board?

most times yes, only one time were we not able to reflash

Hi,

I mean, are you able to reflash your board now?

Hi Wayne

I am not, I am getting “Failed to flash/read t186ref”

Note, this particular board does not output anything on HDMI (not even the Nvidia logo) and there is nothing on the serial port at boot as well.

Thanks for the help

Hi,

Your comments are conflicting each other… if there is no log in serial port at boot, then how did you dump log(1).txt in your first comment…

I understand, my first post was about the several boards I have seen go bad until now. And I posted a log file from back then.

You asked about a particular board and I assumed you were talking about the latest board I had issues with.

Correction from last post, trying to flash a backup image gave me an error, currently flashing using SDK manager is at 50% and going

How many jetson do you have on your side?

Basically, if you are sure your method to dump log is fine and the method to flash your device is correct and only one module has this issue, then it sounds hardware problem.

We have 12 TX2s, of which 3 maybe 4 have had an issue requiring flashing (when possible).

I am also thinking this is a hardware issue, but neither Aetina nor Nvidia has been able to diagnose.

If you need us to diagnose, then you need to at least able to dump uart log. If UART log is not able to dump in both boot or “flash” case, then it could be hardware issue.

Also, if you have NV devkit, try to use NV devkit to check. We don’t know the board design of Aetina.

I was talking about the past issues we have had and that were left undiagnosed. Anyways, maybe we need to focus on this one for now

I could not flash it with the SDK manager, it hung at 99% and the flash never succeeded. So I guess it is a hardware issue.

What could be some causes for this? The board is getting power from a 12V dc/dc converter, there is a 4A inline fuse, maybe some other considerations need to be taken? Could it be a failing EMMC? …

We do have a dev kit and there are no issues flashing or getting a UART log from it

Hi,

Just hope you can understand that. We are not some kind of jetson god who can just tell what is going on with no information. For now, you didn’t share anything helpful. You only told us you cannot flash and you cannot boot.

If same module can work fine if it moves to devkit, then please contact the board vendor to check their custom board.

NV moderators here can only deal with devkit issue if your custom board cannot provide any kind of log out.

I just want to point out that often this is a USB issue, which in turn means it is sometimes a carrier board issue or cable issue (or even a host PC port issue). USB signals in general can change with “minor” changes, e.g., cable length, even though the hardware is otherwise “good”. If you can’t try it on a known good carrier board, then perhaps you could try different cables (about 2 out of 3 “charger” cables fail when used with data at their full speed for large amounts of data) and different host ports (signal quality can change with port change).

Power supply regulation quality can also matter (Jetsons are quite sensitive to this). I doubt it is the issue in this case, but if you have another power supply to try, then you might also swap this (power regulation and filtering can change what noise is on USB, and this in turn can make USB fail in corner cases…larger amounts of data implies a greater chance for any kind of noise to cause failure). If you had another carrier board to try I wouldn’t bother suggesting this, but it is one of the “cheapest” things to test since you don’t have another carrier board.

thanks @linuxdev and @WayneWWW for the support!

@linuxdev I will try a different USB cable and a different power supply. However I have been using those for a while and I never had a problem flashing unless there was a problem with the board

@WayneWWW I think from the start there is a mix up between the problem I am facing currently with a board and the previous problems we have faced.
We have had a case where the board was sent to Aetina that then sent the TX2 module to Nvidia, Nvidia could not diagnose. I understand sometimes it just isn’t possible to find the reason why things fail. That is not the issue. What I want to avoid is more of the boards going bad. Granted you do not have much information to go on, I do not either. So because of the lack of information I am forced to ask very vague questions. I do not expect to get an answer that will solve my issue right away. However maybe trying one or two things might just do that, even if it is a shot in the dark!
Maybe you can refer me to some installation guidelines, best practices that avoid common hardware problems. Understanbly you can not be familiar with every carrier board out there, please be sure that I am also trying to get to the bottom of things with the guys at Aetina.

That being said, this time I actually have “valuable” information to share. Another one of our boards failed. These last few weeks we would have to power cycle it several times before it would boot. As of today it just shows nothing on HDMI and almost never has a boot log. But I did get 2 of the last boot logs. Please see attached. The 3rd log (badbad.txt) is one I get 1 out of 15 times I power on the board, I never get anything on HDMI though.

bad.txt (22.2 KB)
badbad.txt (699 Bytes)

In this one I could actually log in and then a few seconds later it just rebooted and from then it just went into a never ending boot loop.

goodthenbad.txt (52.4 KB)

thanks again!

Hi @WayneWWW , @linuxdev , any update on the log files I sent?

Hi,

The error from “mmc0” indicates the emmc may have something wrong.

There are some points to clarify

  1. I don’t think “emmc error” would cause your board not able to dump uart log anymore.

  2. If only one module has such problem, please RMA it.

  1. Any idea what could be the cause for the uart log not working anymore? Anything I can try? Most likely hardware a issue?

  2. This is the second module with that problem. I did send a third one for RMA months ago, but I am not sure it had the same problem those other 2 have.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.