SATA failure during temperature rise

igal.kroyter · June 10, 2019, 6:52am

Hi,
we’re using the TX2i module. During temperature raise from -10C to 10C (~6C) we witness a SATA failure which is rehabilitated after 1[Sec] and never fails again. Though it is repeated consistently, that is every time the module is going through the same temperature regimen.
The log printed out is hereafter:

[  967.608486] ata1: exception Emask 0x10 SAct 0x0 SErr 0x5980000 action 0xe frozen
[  967.615908] ata1: irq_stat 0x00000040, connection status changed
[  967.621927] ata1: SError: { 10B8B Dispar LinkSeq TrStaTrns DevExch }
[  967.628289] ata1: hard resetting link
[  968.531677] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  968.538092] ata1.00: failed to get Identify Device Data, Emask 0x1
[  968.544579] ata1.00: failed to get Identify Device Data, Emask 0x1
[  968.550800] ata1.00: configured for UDMA/133
[  968.567669] ata1: EH complete

Notes

We have applied this tests with multiple modules and multiple SSDs and always got this failure issue.
We have connected a differential probe to the SATA lanes and the "eye" looks OK

Is there away to make it go away? Out unit is in the middle of accessing the external SSD device.

Thanks.

Trumany · June 10, 2019, 8:59am

Hi, are you testing on devkit or custom carrier board? If latter, what’s the result with devkit + same sw?

igal.kroyter · June 10, 2019, 9:04am

Hi,

I test it with a custom board. I was under the impression that the devkit is for above 0C, isn’t it?

Trumany · June 11, 2019, 2:36am

Yes, carrier board is above 0C, but the issue is at ~6C, right?

Is this failure still there in room temperature? The SSD is in same temperature or not? Did you test and analyze the full timing of SATA signals during power on?

WayneWWW · June 11, 2019, 3:19am

Could you also share the full dmesg instead of this partial error?

igal.kroyter · June 11, 2019, 5:41am

Hi,

We are running actually a burn-in test where the Jetson TX2i is part of the system. This problem - momentary SATA crashing - occurs only when the temperature gradient is high - 10[oC/Minute], while when the gradient is lower this issue does not occur (at any temperature). When the problem occurs it is always at the same temperature.
I have attached the full dmesg, one can see at the end of the log the printed out temperatures and the crash-recovery dump. These prints out are run by our application which also mounts a SATA SSD.

I know that controllers which handle high frequency transmission (SATA, PCIe) have to change the lane configuration with respect to temperature, is it the case here, is there something I can do about it?

Any other ideas?

Regards.

Bibek · June 11, 2019, 12:58pm

TX2i has an operating range of -40 to 85C.
Can you please run your test on devkit?

Trumany · June 13, 2019, 1:29am

Several questions:

10c/min is not very fast for the HW itself.

For Jetson module, how are you controlling the temperature for the chip in the experiment?

Is the failure speed or boot temperature dependent? Also, to be clear is the issue for both a rise and a fall of temperature or just rise?

igal.kroyter · June 13, 2019, 5:06am

bbasu, hi,
we are waiting for chamber time for the testing with the devkit.

Trumany, hi,
What do you mean by controlling? I did not set any threshold value, I use the default settings.
We have witnessed the failure only in raise temperature (we did not test while temperature drop.
The module is turned on (and completes boot) while soaked in -40C, and then the temperature raised.
I think that as “all” non-understood failures it is a combination of a few factors: temperature change rate and the fact that there are different bus control configuration (as I noted in #6) - again these are my assumptions.

BTW we have run the test again in a smaller chamber (not-controlled) and the failure occurred while in 23C.

Regards.

Trumany · June 28, 2019, 3:54am

Have you tested with DevKit?

So the failure occurred at 23C with smaller chamber and at 6C in previous chamber?

igal.kroyter · July 1, 2019, 6:08am

Hi,

We’ve tested the DevKit too. and it has failed too with the same message (see #1). During the test we are setting a destination temperature on the chamber and we read the chamber’s temperature sensor: 60C. It looks like we have the same behavior, DevKit vs. Custom board. We have tested it with the smaller chamber.

Any advise will be appreciated.

Trumany · July 1, 2019, 6:28am

Is it -60C ? I’m a little confused by this. Could you please list the test situation on custom board and DevKit board clearly? And please print out the value of thermal zones repeatedly during test so as to capture the real chip temperature when failure happen. You can refer to this topic to read thermal zones: [url]https://devtalk.nvidia.com/default/topic/1032887[/url]

igal.kroyter · July 1, 2019, 6:49am

Trumany, hi,
It is +60C. NOT -60C (I have updated the message #11).
Moreover, the smaller chamber temperature does not provide stable results for example in the following test the failure occurred when the chamber was at 44C:

chamber temperature temp
10:54 -40
10:55 -5
10:56 +12
10:57 +30
10:58 +44 ** SATA error

hereafter are the readings from the Module:

nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone7/temp
0
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone6/temp
100000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone5/temp
-43500
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone4/temp
-45000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone3/temp
-34000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone2/temp
-36000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone1/temp
-34000
nvidia@tegra-ubuntu:~$
nvidia@tegra-ubuntu:~$ [  337.702080] ata1: exception Emask 0x10 SAct 0x0 SErr 0x59d0000 action 0xe frozen
[  337.709565] ata1: irq_stat 0x00000040, connection status changed
[  337.715738] ata1: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq TrStaTrns DevExch }
[  337.723843] ata1: hard resetting link
[  338.623311] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[  338.629976] ata1.00: failed to get Identify Device Data, Emask 0x1
[  338.636676] ata1.00: failed to get Identify Device Data, Emask 0x1
[  338.642965] ata1.00: configured for UDMA/133
[  338.663246] ata1: EH complete
cat /sys/devices/virtual/thermal/thermal_zone1/temp
3000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone2/temp
1500
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone3/temp
4000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone4/temp
-1000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone5/temp
500
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone6/temp
100000
nvidia@tegra-ubuntu:~$ cat /sys/devices/virtual/thermal/thermal_zone7/temp
6200

Trumany · July 5, 2019, 2:49am

Thermal zones are in the range. It is strange the failure happened. Anyway it does’t affect normal boot up, right?

igal.kroyter · July 5, 2019, 8:00am

Trumany, hi,

it does not affect normal boot up, but it affects accessing the SSD during temperature flactuations, which is our system’s reality.

Any ideas how to resolve it?

Trumany · July 25, 2019, 8:07am

Currently can only suggest to try to avoid such temperature fluctuations.

nme_0001 · October 8, 2019, 4:19pm

Hi igal.kroyter,
I am sorry, I can resolve your problem but the tests you have done interest me.
Can you tell me until what high temperature have you test the TX2i?
On my side I have a problem at 52 °C, I lost ethernet connection, and the power consumption drop down.
I specify that we only use a passive dissipation.

Thank you