NVME SSD probe random failure on Jetson nano 2GB module(B01)

Dear nvidia,

I am trying to mount SSD on my custom carrier board for Jetson Nano. I am using a 2GB module (B01) for testing.
The 2 nvme SSDs I used are all from Western digital, model: SN750 and SN530, both support pcie Gen3x4. They both work fine on my PC.

My custom carrier board exposed pcie #0 of the Jetson module for SSD connection (using a M.2 connector). After power up when Jetson module probe NVME device, it fails randomly, with the following error messages:

[    1.451225] tegra-pcie 1003000.pcie: PCIE: Response decoding error, signature: 11021005
[    1.459364] nvme nvme0: Minimum device page size 134217728 too large for host (4096)

The nvme device did not show up after the boot procedure completed.

However, it does succeed sometime. if the probe succeeds, I can see see nvme device and it can operates correctly as expected. The chances of success against failure is about 50 : 50, and I did not find any other pattern behind the scene yet.

I tested both 2 SSDs and they behaved quite the same way. I am not sure about the reason behind this. Can it be because of the hardware timing issue of PCIe bus, or compatibility problem between jetson module and the SSD controller, or perhaps unstable power supply?

Hope you can help me to find some direction to work this out.

Below is some diagnosis information that may help:

roland@nano1:~$ dmesg | grep nvme
[    1.447772] nvme nvme0: pci function 0000:01:00.0
[    1.447801] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[    1.456391] nvme nvme0: Minimum device page size 134217728 too large for host (4096)
[    1.464187] nvme nvme0: Removing after probe failure status: -19

and,

roland@nano1:~$ lspci -vvvv
.....
01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5007 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Sandisk Corp Device 5007
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 82
        Region 0: Memory at 13000000 (64-bit, non-prefetchable) [size=16K]
        Region 4: Memory at 13004000 (64-bit, non-prefetchable) [size=256]
        Capabilities: <access denied>

and

roland@nano1:~$ dmesg | grep pcie
[    1.010948] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.3, lane = pcie-0, function = pcie-x1
[    1.011037] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.4, lane = pcie-1, function = pcie-x4
[    1.011129] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.5, lane = pcie-2, function = pcie-x4
[    1.011213] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.6, lane = pcie-3, function = pcie-x4
[    1.011304] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.7, lane = pcie-4, function = pcie-x4
[    1.011410] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.8, lane = pcie-5, function = xusb
[    1.011498] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.9, lane = pcie-6, function = xusb
[    1.021907] tegra-pcie 1003000.pcie: 4x1, 1x1 configuration
[    1.021952] tegra-pcie 1003000.pcie: PHY get deferred: -517
[    1.021958] tegra-pcie 1003000.pcie: failed to get PHYs: -517
[    1.393290] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.3, lane = pcie-0, function = pcie-x1
[    1.393352] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.4, lane = pcie-1, function = pcie-x4
[    1.393422] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.5, lane = pcie-2, function = pcie-x4
[    1.393481] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.6, lane = pcie-3, function = pcie-x4
[    1.393535] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.7, lane = pcie-4, function = pcie-x4
[    1.393591] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.8, lane = pcie-5, function = xusb
[    1.393645] tegra-xusb-padctl 7009f000.xusb_padctl: dev = phy-pcie.9, lane = pcie-6, function = xusb
[    1.395075] tegra-pcie 1003000.pcie: 4x1, 1x1 configuration
[    1.397187] tegra-pcie 1003000.pcie: PCIE: Enable power rails
[    1.398527] tegra-pcie 1003000.pcie: probing port 0, using 4 lanes
[    1.401467] tegra-pcie 1003000.pcie: probing port 1, using 1 lanes
[    1.431096] tegra-pcie 1003000.pcie: PCI host bridge to bus 0000:00
[    1.458369] pcieport 0000:00:01.0: Signaling PME through PCIe PME interrupt
[    1.458378] pcie_pme 0000:00:01.0:pcie001: service driver pcie_pme loaded
[    1.458471] aer 0000:00:01.0:pcie002: service driver aer loaded
[    1.458631] pcieport 0000:00:02.0: Signaling PME through PCIe PME interrupt
[    1.458639] pcie_pme 0000:00:02.0:pcie001: service driver pcie_pme loaded
[    1.458708] aer 0000:00:02.0:pcie002: service driver aer loaded

Try if this patch helps.


diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 60958d5..385ae3f1 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -5,7 +5,7 @@
  * Author: Mike Rapoport <mike@compulab.co.il>
  *
  * Based on NVIDIA PCIe driver
- * Copyright (c) 2008-2018, NVIDIA Corporation. All rights reserved.
+ * Copyright (c) 2008-2022, NVIDIA Corporation. All rights reserved.
  *
  * Bits taken from arch/arm/mach-dove/pcie.c
  *
@@ -635,6 +635,7 @@
 		      (PCI_FUNC(devfn) << 8) | (where & 0xff);
 		addr = (val & (SZ_4K - 1)) + addr;
 		val = val & ~(SZ_4K - 1);
+		afi_writel(pcie, SZ_4K >> 12, AFI_AXI_BAR0_SZ);
 		afi_writel(pcie, pcie->cs->start - val, AFI_AXI_BAR0_START);
 		afi_writel(pcie, (val + SZ_4K) >> 12, AFI_AXI_BAR0_SZ);
 	}

BTW, try to read the forum board name first before you file topic. You are asking jetson nano issue on AGX Orin forum. Let me move it to correct forum.

Thank you @WayneWWW for moving the post and the guide information.

I will try it out.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.