M.2 NVMe not detected on custom Xavier NX hardware (pcie link is down)

We have noticed an issue very similar to Lost NVMe SSD on L4T 31.1.0 upgrade where the M.2 slot generally does not work with NVMe. We’ve tried at least 5 different vendors/capacities and the only one we’ve found which works at all is Samsung 960 EVO 250GB. The Samsung 960 EVO 250GB works reliably every time. The Samsung EVO 970 500 GB, Sabrent, Seagate and Crucial parts are not detected in lsblk output and dmesg prints show

We suspect a carrier board hardware issue and not purely a software issue, since we can swap the same SOM (and eMMC software) between our carrier board and an Xavier NX development kit and see different results. When booting on the Xavier NX devkit every NVMe we’ve tried is always detected successfully by lsblk and in dmesg pcie prints. When booting with our carrier board the only NVMe which is ever detected successfully in lsblk or dmesg pcie print statements is the Samsung 960 EVO 250GB. We’ve also eliminated all of our software changes by reproducing the same observations with unmodified L4T version R32.43.

Here are pcie related dmesg prints in failure cases:

[    0.889722] iommu: Adding device 14160000.pcie to group 0
[    0.890546] iommu: Adding device 141a0000.pcie to group 1
[    3.169533] ehci-pci: EHCI PCI platform driver
[    3.171974] ohci-pci: OHCI PCI platform driver
[   13.426699] tegra-pcie-dw 14160000.pcie: Setting init speed to max speed
[   13.434522] OF: PCI: host bridge /pcie@14160000 ranges:
[   13.970447] tegra-pcie-dw 14160000.pcie: link is down
[   13.970777] tegra-pcie-dw 14160000.pcie: PCI host bridge to bus 0004:00
[   13.970898] pci_bus 0004:00: root bus resource [bus 00-ff]
[   13.970998] pci_bus 0004:00: root bus resource [io  0x0000-0xfffff] (bus address [0x36100000-0x361fffff])
[   13.971160] pci_bus 0004:00: root bus resource [mem 0x1740000000-0x17ffffffff] (bus address [0x40000000-0xffffffff])
[   13.971335] pci_bus 0004:00: root bus resource [mem 0x1400000000-0x173fffffff pref]
[   13.971495] pci 0004:00:00.0: [10de:1ad1] type 01 class 0x060400
[   13.971641] pci 0004:00:00.0: PME# supported from D0 D3hot D3cold
[   13.972346] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[   13.972467] pci 0004:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[   13.972881] pcieport 0004:00:00.0: Signaling PME through PCIe PME interrupt
[   13.973005] pcie_pme 0004:00:00.0:pcie001: service driver pcie_pme loaded
[   13.973122] aer 0004:00:00.0:pcie002: service driver aer loaded
[   13.973326] pcie_pme 0004:00:00.0:pcie001: unloading service driver pcie_pme
[   13.973443] aer 0004:00:00.0:pcie002: unloading service driver aer
[   13.973526] pci_bus 0004:01: busn_res: [bus 01-ff] is released
[   13.973838] pci_bus 0004:00: busn_res: [bus 00-ff] is released
[   13.975393] tegra-pcie-dw 14160000.pcie: PCIe link is not up...!
[   13.976229] tegra-pcie-dw 141a0000.pcie: Setting init speed to max speed
[   13.977318] OF: PCI: host bridge /pcie@141a0000 ranges:
[   14.506545] tegra-pcie-dw 141a0000.pcie: link is down
[   14.506878] tegra-pcie-dw 141a0000.pcie: PCI host bridge to bus 0005:00
[   14.507003] pci_bus 0005:00: root bus resource [bus 00-ff]
[   14.507105] pci_bus 0005:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x3a100000-0x3a1fffff])
[   14.507293] pci_bus 0005:00: root bus resource [mem 0x1f40000000-0x1fffffffff] (bus address [0x40000000-0xffffffff])
[   14.507494] pci_bus 0005:00: root bus resource [mem 0x1c00000000-0x1f3fffffff pref]
[   14.507646] pci 0005:00:00.0: [10de:1ad0] type 01 class 0x060400
[   14.507795] pci 0005:00:00.0: PME# supported from D0 D3hot D3cold
[   14.508467] pci 0005:00:00.0: PCI bridge to [bus 01-ff]
[   14.508576] pci 0005:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[   14.508946] pcieport 0005:00:00.0: Signaling PME through PCIe PME interrupt
[   14.509070] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded
[   14.509171] aer 0005:00:00.0:pcie002: service driver aer loaded
[   14.509342] pcie_pme 0005:00:00.0:pcie001: unloading service driver pcie_pme
[   14.509463] aer 0005:00:00.0:pcie002: unloading service driver aer
[   14.509552] pci_bus 0005:01: busn_res: [bus 01-ff] is released
[   14.509897] pci_bus 0005:00: busn_res: [bus 00-ff] is released
[   14.511440] tegra-pcie-dw 141a0000.pcie: PCIe link is not up...!

Here are pci dmesg prints in the successful case:

[    0.889797] iommu: Adding device 14160000.pcie to group 0
[    0.890685] iommu: Adding device 141a0000.pcie to group 1
[    3.173827] ehci-pci: EHCI PCI platform driver
[    3.175919] ohci-pci: OHCI PCI platform driver
[   13.510406] tegra-pcie-dw 14160000.pcie: Setting init speed to max speed
[   13.518290] OF: PCI: host bridge /pcie@14160000 ranges:
[   14.054018] tegra-pcie-dw 14160000.pcie: link is down
[   14.054350] tegra-pcie-dw 14160000.pcie: PCI host bridge to bus 0004:00
[   14.054471] pci_bus 0004:00: root bus resource [bus 00-ff]
[   14.054570] pci_bus 0004:00: root bus resource [io  0x0000-0xfffff] (bus address [0x36100000-0x361fffff])
[   14.054732] pci_bus 0004:00: root bus resource [mem 0x1740000000-0x17ffffffff] (bus address [0x40000000-0xffffffff])
[   14.054908] pci_bus 0004:00: root bus resource [mem 0x1400000000-0x173fffffff pref]
[   14.055073] pci 0004:00:00.0: [10de:1ad1] type 01 class 0x060400
[   14.055226] pci 0004:00:00.0: PME# supported from D0 D3hot D3cold
[   14.055922] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[   14.056031] pci 0004:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[   14.056448] pcieport 0004:00:00.0: Signaling PME through PCIe PME interrupt
[   14.056571] pcie_pme 0004:00:00.0:pcie001: service driver pcie_pme loaded
[   14.056672] aer 0004:00:00.0:pcie002: service driver aer loaded
[   14.056857] pcie_pme 0004:00:00.0:pcie001: unloading service driver pcie_pme
[   14.056969] aer 0004:00:00.0:pcie002: unloading service driver aer
[   14.057054] pci_bus 0004:01: busn_res: [bus 01-ff] is released
[   14.057285] pci_bus 0004:00: busn_res: [bus 00-ff] is released
[   14.058872] tegra-pcie-dw 14160000.pcie: PCIe link is not up...!
[   14.060048] tegra-pcie-dw 141a0000.pcie: Setting init speed to max speed
[   14.061198] OF: PCI: host bridge /pcie@141a0000 ranges:
[   14.185805] tegra-pcie-dw 141a0000.pcie: link is up
[   14.186249] tegra-pcie-dw 141a0000.pcie: PCI host bridge to bus 0005:00
[   14.186396] pci_bus 0005:00: root bus resource [bus 00-ff]
[   14.186493] pci_bus 0005:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x3a100000-0x3a1fffff])
[   14.186656] pci_bus 0005:00: root bus resource [mem 0x1f40000000-0x1fffffffff] (bus address [0x40000000-0xffffffff])
[   14.186852] pci_bus 0005:00: root bus resource [mem 0x1c00000000-0x1f3fffffff pref]
[   14.187003] pci 0005:00:00.0: [10de:1ad0] type 01 class 0x060400
[   14.187142] pci 0005:00:00.0: PME# supported from D0 D3hot D3cold
[   14.187830] pci 0005:01:00.0: [144d:a804] type 00 class 0x010802
[   14.187943] pci 0005:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[   14.198581] pci 0005:00:00.0: BAR 14: assigned [mem 0x1f40000000-0x1f400fffff]
[   14.198738] pci 0005:01:00.0: BAR 0: assigned [mem 0x1f40000000-0x1f40003fff 64bit]
[   14.198963] pci 0005:00:00.0: PCI bridge to [bus 01-ff]
[   14.199080] pci 0005:00:00.0:   bridge window [mem 0x1f40000000-0x1f400fffff]
[   14.199233] pci 0005:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[   14.199446] pci 0005:01:00.0: Max Payload Size set to  256/ 256 (was  128), Max Read Rq  512
[   14.199793] pcieport 0005:00:00.0: Signaling PME through PCIe PME interrupt
[   14.202120] pci 0005:01:00.0: Signaling PME through PCIe PME interrupt
[   14.208492] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded
[   14.208600] aer 0005:00:00.0:pcie002: service driver aer loaded
[   14.208962] nvme nvme0: pci function 0005:01:00.0

We’ve compared the pinout of the M.2 connector slot on the Xavier NX dev board and our hardware and I don’t see anything which could explain this. The only notable difference are the CONFIG pins, which are grounded on the Xavier NX dev board, floating on our hardware, and which the PCIe mechanical spec suggests should be pulled up. However these pins should be output from the module so I don’t understand how this could be related.

The post at no PCIe link with some devices - #11 by parafin also looks similar but should be resolved in the JP 4.4 release we are using and patches described there shouldn’t relate to the devkit hardware.

My suspicion is that power supply sequencing or other power related issues are causing this since that’s pretty much the only thing left. I’ve noticed the Xavier NX devkit uses SYS_RESET to drive BUCK_3V3_EN and enable the 3.3V supply. Our carrier board is missing this logic and the 3.3V rail rises independently of SYS_RESET output, not matching section 5.1 of the Xavier NX design guide description. We’ve tried to prove whether this is the issue by replacing our 3.3V supply with a benchtop supply and waiting for a short delay before bringing up the 3.3V supply. We still noticed the same behavior in this case.

Any suggestions or troubleshooting ideas are appreciated.

I think we’ve eliminated the differences in power supply as being related after a rework. We are now following the Xavier NX design guide section 5.1 but still see exactly the same behavior on NVMe support.

In case it helps anyone else, we’ve noticed in our case PCIe lanes 1 and 3 are swapped on our layout. I’m not sure why/how this works successfully with Samsung 960 EVO. We haven’t proven this solves the problem with other NVMe but I highly suspect it will.

We are experiencing similar problems, although our issues appear on a custom carrier board with ASM3142 chip utilizing two PCIe lanes of C5 controller. The dmesg output is exactly similar to yours.

In our case, an ASM2142 enabled pcie-device is detected fine when installed on key m socket of the Xavier NX evaluation board.

Could you specify, do you mean, there could be an error in Nvidia documentation according to lanes 1 and 3, or do you only suspect a human error in your own layout?

Could you specify, do you mean, there could be an error in Nvidia documentation according to lanes 1 and 3, or do you only suspect a human error in your own layout?

I believe it was human error in our layout, or at least if there was a problem in Nvidia documentation I haven’t found it yet. If you match the development board schematic for your PCIe lane setup (P3509_A01_OrCad_Schematic dated 27-Nov-2019) you should have the correct assignment for lanes 1 and 3.