Hi,
we’re using the TX2i module. During temperature raise from -10C to 10C (~6C) we witness a SATA failure which is rehabilitated after 1[Sec] and never fails again. Though it is repeated consistently, that is every time the module is going through the same temperature regimen.
The log printed out is hereafter:
[ 967.608486] ata1: exception Emask 0x10 SAct 0x0 SErr 0x5980000 action 0xe frozen
[ 967.615908] ata1: irq_stat 0x00000040, connection status changed
[ 967.621927] ata1: SError: { 10B8B Dispar LinkSeq TrStaTrns DevExch }
[ 967.628289] ata1: hard resetting link
[ 968.531677] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 968.538092] ata1.00: failed to get Identify Device Data, Emask 0x1
[ 968.544579] ata1.00: failed to get Identify Device Data, Emask 0x1
[ 968.550800] ata1.00: configured for UDMA/133
[ 968.567669] ata1: EH complete
Notes
We have applied this tests with multiple modules and multiple SSDs and always got this failure issue.
We have connected a differential probe to the SATA lanes and the "eye" looks OK
Is there away to make it go away? Out unit is in the middle of accessing the external SSD device.
Yes, carrier board is above 0C, but the issue is at ~6C, right?
Is this failure still there in room temperature? The SSD is in same temperature or not? Did you test and analyze the full timing of SATA signals during power on?
We are running actually a burn-in test where the Jetson TX2i is part of the system. This problem - momentary SATA crashing - occurs only when the temperature gradient is high - 10[oC/Minute], while when the gradient is lower this issue does not occur (at any temperature). When the problem occurs it is always at the same temperature.
I have attached the full dmesg, one can see at the end of the log the printed out temperatures and the crash-recovery dump. These prints out are run by our application which also mounts a SATA SSD.
I know that controllers which handle high frequency transmission (SATA, PCIe) have to change the lane configuration with respect to temperature, is it the case here, is there something I can do about it?
bbasu, hi,
we are waiting for chamber time for the testing with the devkit.
Trumany, hi,
What do you mean by controlling? I did not set any threshold value, I use the default settings.
We have witnessed the failure only in raise temperature (we did not test while temperature drop.
The module is turned on (and completes boot) while soaked in -40C, and then the temperature raised.
I think that as “all” non-understood failures it is a combination of a few factors: temperature change rate and the fact that there are different bus control configuration (as I noted in #6) - again these are my assumptions.
BTW we have run the test again in a smaller chamber (not-controlled) and the failure occurred while in 23C.
We’ve tested the DevKit too. and it has failed too with the same message (see #1). During the test we are setting a destination temperature on the chamber and we read the chamber’s temperature sensor: 60C. It looks like we have the same behavior, DevKit vs. Custom board. We have tested it with the smaller chamber.
Is it -60C ? I’m a little confused by this. Could you please list the test situation on custom board and DevKit board clearly? And please print out the value of thermal zones repeatedly during test so as to capture the real chip temperature when failure happen. You can refer to this topic to read thermal zones: [url]https://devtalk.nvidia.com/default/topic/1032887[/url]
Trumany, hi,
It is +60C. NOT -60C (I have updated the message #11).
Moreover, the smaller chamber temperature does not provide stable results for example in the following test the failure occurred when the chamber was at 44C:
chamber temperature temp
10:54 -40
10:55 -5
10:56 +12
10:57 +30
10:58 +44 ** SATA error
Hi igal.kroyter,
I am sorry, I can resolve your problem but the tests you have done interest me.
Can you tell me until what high temperature have you test the TX2i?
On my side I have a problem at 52 °C, I lost ethernet connection, and the power consumption drop down.
I specify that we only use a passive dissipation.