Tegra not recognizing Hard Drive

We are using the NVIDIA Jetson TK1 Tegra board. It is running Ubuntu 14.04.1 LTS with a 3.10.40-gdacac96 armv7l kernel. We are using CUDA 6.5 as provided with the drivers in the Tegra R21.4.0 package. We are using a Samsung 850 EVO SSD connected via SATA connection.

We have observed a number of our systems that use this Tegra board that intermittently cannot detect the SSD drive we’ve attached, and other systems that have yet to have this problem. It has been unpredictable and erratic.

We have found other posts here from people that seem to have the same issue as us.

https://devtalk.nvidia.com/default/topic/830349/jetson-tk1-and-sata-drive-issue/

What we have tried from this post:

  • The latest issue R21.5 was installed, we still observed this issue.
  • Swapped the drive with another brand, still observed this issue.
  • Updated to a later Linux kernel, still observed this issue.
  • Checked for the latest firmware for SSD, it is at its latest.

Out of 10 Tegra boards in our possession, 3 exhibit this behavior. That seems like rather high odds.

A clip from dmesg:

[    9.310555] ata1.00: qc timeout (cmd 0xec)
[    9.316488] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   19.323506] ata1: softreset failed (1st FIS failed)
[   24.841505] ata1: link is slow to respond, please be patient (ready=0)
[   29.340503] ata1: softreset failed (device not ready)
[   29.809541] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   29.820687] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[   29.831734] ata1: limiting SATA link speed to 1.5 Gbps
[   44.819500] ata1: softreset failed (1st FIS failed)
[   45.288515] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   55.299529] ata1.00: qc timeout (cmd 0xec)
[   55.308555] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   55.779535] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Would it be possible to swap a drive and the cable used with that drive from a unit exhibiting the behavior with a drive and cable from a unit which does not exhibit that behavior, and see if the problem follows?

I don’t know if this actually applies to SATA, but you might try an experiment, edit kernel file “arch/arm/mach-tegra/tegra12_clock.h”. Change line 29 from “define USE_PLLE_SS 1” to:

#ifdef USE_PLLE_SS
#undef USE_PLLE_SS
#endif /* USE_PLLE_SS */

I have not done this the way you are describing, but I have connected a completely new SSD (from different manufacturer and ones from the same) and a completely new cable to the same unit, and still the problem persists. It may be a challenge to get the same setup as you describe, but it’s something I’ll try.

I am unable to find the tegra12_clock.h in the path “/usr/src/linux-headers-3.10.40-gdacac96/arch/arm/mach-tegra/” for R21.4 and kernel 3.10.40 as you suggest. I will work on getting everything to the latest version so that we can be on the same page.

Edit: The filename is clock.h, I think.

Edit 2: I’m told the mentioned thread did suggest that the R21.4 solves the issue, but it did not. R21.5 as far as we tested also does not solve the issue.

You will need the full kernel source, but yes, the correct file is clock.h. Apparently my brain had a bit of data corruption. The actual user of the define is in tegra12_clocks.c, and I had both files open while I was looking at it.

There were a number of SSD issues fixed, but there may be more than one issue beyond what was already fixed. I just have a suspicion that in cases where a drive or PCIe device sometimes works but sometimes does not…and yet works on other machines…it might be a signal quality issue which spread spectrum pushes from working to marginal and sometimes working. On a desktop PC this would be enabled or disabled in BIOS, but normally not enabled. Desktop PC overclockers would never enable this, it limits top end performance. The reason for enabling spread spectrum on a computer would be to avoid generating as much noise (e.g., to avoid noise on audio equipment).

Just an update: we are working today to try the kernel modification you suggested.

We’ve went through both hardware and software checks (essentially, hardware guys are blaming the software, and the software guys are blaming the hardware). It looks like a power failure to the software guys, but hardware checks show nothing out of the ordinary.

A caveat I forgot to mention: these are modified boards. The audio has been removed, wires replace the ethernet adapter and switch/status pinout, etc. We have done this before, however, we have not done this with solid state drives attached.

As for the kernel modification and building, we are following this tutorial from two years ago with some changes, of course: https://devtalk.nvidia.com/default/topic/762653/-howto-build-own-kernel-for-jetson-tk1/

Once we feel like we have the process down, I’ll post any changes that are different from that process (the kernel version and some menuconfig options, for example).

This is just a general comment, but changing any of the traces could have an effect on signal quality even if the changes are basically correct. In this case the spread spectrum issues where spread spectrum has a harder time with signal quality there would still basically be the same issue. It would be interesting to see a very high quality view of the signals from a high end oscilloscope both with and without spread spectrum, and then with and without spread spectrum on a board with the modifications you mention. If this is the case, then both your hardware guys and software guys are correct at the same time (you must have quantum engineers! :P)

After making the change, we also made the further changes:

  • tegra12_clocks.c: (4221) #if => #ifdef

So I’ve finally got the changes to the kernel and I’m still getting the SATA fail. No such luck there.

Edit: We’ve also tested a combination of different drives, cables, power sources and Tegras–drive with bad Tegra power and SATA cable to good Tegra (works), drive with good Tegra power and SATA to bad Tegra (fails), drive with good Tegra power and SATA cable to good Tegra (works). Replacing the cables seem to do no good.

Something that I should add is it does seem to fail less often with the kernel code change, but it might just be my mind. Yeah, it’s possible we’re all going out of our minds here.

I am working with OrrinJelo on this issue, and we still do not have a resolution.

Power supplies investigation results: used an oscilloscope and compared Jetson boards that have no SATA SSD recognition issues and ones that do and can find no detectable differences. Rise times on supplies, noise/ripple on supplies, voltage levels, sequencing, etc are indistinguishable.

We have followed this thread and recompiled the kernel as suggested, and saw no improvement.

30% of the Tegra boards exhibiting this problem.

Any suggestions?

SSD failure is still intermittent on the same Tegra boards.

On the Tegras that we have installed the kernel fix (clock.h changes), we might be seeing a kernel panic once in a while. I’m thinking of rolling back to the old kernel tomorrow.

So it looks like we’re back on square one, knowing that it’s not a cable issue, power issue, or drive issue. Any other thoughts on where to go from here?

It’s interesting because the issue seems to track with specific Jetsons, rather than tracking with the specific drive (and not all Jetsons have the issue, although a high percentage do). On the Jetsons which fail, do regular SATA drives have any issue, or is it just SSD drives?

Hi OrionJelo,

Can you try the below patch and tell us if it fixes the issue.

diff --git a/drivers/ata/ahci-tegra.c b/drivers/ata/ahci-tegra.c
index a80a8b1..4725798 100644
--- a/drivers/ata/ahci-tegra.c
+++ b/drivers/ata/ahci-tegra.c
@@ -1081,12 +1081,10 @@ static int tegra_ahci_controller_init(struct tegra_ahci_host_priv *tegra_hpriv,
 	val &= ~NVA2SATA_OOB_ON_POR_MASK;
 	misc_writel(val, SATA_AUX_MISC_CNTL_1_REG);
 
-	if (tegra_hpriv->sata_connector != MINI_SATA) {
-		/* Disable DEVSLP Feature */
-		val = misc_readl(SATA_AUX_MISC_CNTL_1_REG);
-		val &= ~SDS_SUPPORT;
-		misc_writel(val, SATA_AUX_MISC_CNTL_1_REG);
-	}
+	/* Disable DEVSLP Feature */
+	val = misc_readl(SATA_AUX_MISC_CNTL_1_REG);
+	val &= ~SDS_SUPPORT;
+	misc_writel(val, SATA_AUX_MISC_CNTL_1_REG);
 
 	val = sata_readl(SATA_CONFIGURATION_0_OFFSET);
 	val |= EN_FPCI;

Edit: Please try the above change on R21.5

Thanks & Regards
Preetham

So I think we finally concluded that it is nothing to do with the power to the drive, the SSD drive, or the SATA cable, or our modifications that are causing this issue. We tried a mechanical hard drive and it failed with the same units in question.

We ordered a several more TK1s and found one (unmodified) that acted similarly. We did the same tests with the SSD and mechanical SATA drives, swapping out cables, power cables, etc.

I am currently working on getting the above patch installed. We’ll see if that does it.

I thought I updated this thread with the results of the patch, turns out I didn’t.

The SSD issues persist with that kernel patch. No go there as well.

We do have some newly out of the box Tegras, at least one of which seem to exhibit the same SSD/SATA issue of not recognizing the drive now and then. Nothing’s been observed on the other units.

So the 3 boards that were failing earlier its not failing anymore with the new patch? It’s only failing with the new units that you received? or is it that it’s not failing even without applying the above patch for the 3 earlier boards?

On the failing unit when it boots successfully can you please attach the SATA register dumps? You can obtain as below:

cat /sys/kernel/debug/tegra_ahci

Also please attach the complete uart log for the failure case with and without the above patch.

Is it possible to collect and share LeCroy trace which will help us analyze why exactly it is failing?

thanks.

The 3 boards that were failing earlier are still failing with the patch. Of the new units, we have detected one that has the same issue.

I’ll try to get those things done and posted tomorrow.

Hi OrrinJelo,
This problem be solved later?Because I also encountered this problem recently.

As far as I know, we were not able to solve the problem for the units that failed. We kept tabs on the units that succeeded and failed, and used only those that succeeded.

Edit: FYI, we have since moved on to use the TK1 and TX2i modules in incorporated them onto house-designed boards. These work fine. The above issue seen was only seen with TK1 dev boards, it seems.

Hi OrrinJelo,
Thanks for your replay!
Do you have used TK1 chipset on your own designed boards?
We also have designed TK1 chipset on our product,but unfortunately,about 5% of the TK1 platform product that intermittently cannot detect the HDD drive in mass production.

Unfortunately I can’t verify your issue. We have since moved on to use eMMC instead of HDD in our products. I am not directly involved in the manufacturing process or board bring up, so I’m not sure what the failure rate is on that end.