Orin NX can't detect NVMe storage (SD Express PCIe)

Hi! I’m using Orin NX 16GB module with a custom carrier board, which has already been in production use with Xavier NX.

I’m using Nvidia BSP 36.3, and the board with Orin can boot into initrd, where it cannot find the NVMe device. This is essential as Orin doesn’t have internal eMMC.

Error: Could not stat device /dev/nvme0n1 - No such file or directory.

Using Workflow 3:
Building the image:

sudo ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" ./tools/kernel_flash/l4t_initrd_flash.sh --no-flash --external-device nvme0n1p1 -c ./tools/kernel_flash/flash_l4t_t234_nvme.xml --showlogs jetson-orin-nano-devkit nvme0n1p1

full output of build:
jetson-orin-build-3.log (261.0 KB)

Flashing the device:

sudo ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" ./tools/kernel_flash/l4t_initrd_flash.sh --flash-only --external-device nvme0n1p1 -c ./tools/kernel_flash/flash_l4t_t234_nvme_orb.xml --showlogs jetson-orin-nano-devkit nvme0n1p1

full output of flashing:
jetson-orin-flash-3-no-nvme.log (8 KB)

full serial output:
jetson-orin-serial-output-3.log (60.6 KB)

The PCIe controllers seem to come up, and the NVMe kernel modules are loaded. However on serial you can see

Connection timeout: device /dev/nvme0n1 is still not ready.

The SD Express PCIe card is connected to PCIE1_XXX pins on Jetson, so it should not change between Xavier NX and Orin NX.
I reviewed the differences in pinout between Xavier NX and Orin NX in Jetson_OrinNano_OrinNX_XavierNX_Interface_Comparison_Migration_DA-11081-001_v1.1.pdf, and

  • SDMMC interface pins from Xavier are PCIE3 pins on Orin - so not related
  • CSI4 interface pins from Xacier are PCIE2 pins on Orin - so not related
    There are no differences listed for PCIE1 pins.

One change we had to make:
DevKit has EEPROM containing info about devkit used by Nvidia to dermine the devkit version but our carrier board does not have it, to make the board boot we had to change the EEPROM size from 0x100 to 0x0 in bootloader/generic/BCT/tegra234-mb2-bct-misc-p3767-0000.dts:

cvb_eeprom_read_size = <0x0>

Here the PCIe/SD interface, connected to PCIE1_XXX pins, which are visible in the schematic:

Already confirmed that the correct voltage for SD card is present on the 3.3V pin.

Should we make changes to ODMDATA, device tree or pinmux to be able to detect the NVMe?

I’ll be grateful for all pointers and help.

  1. The flash command is not correct.
    Quick Start — NVIDIA Jetson Linux Developer Guide 1 documentation

  2. We didn’t validate any SD express PCIe card before. From the log, I only saw it cannot get detected in PCIe C1 controller. No ODMDATA change is needed. It is enabled on Orin Nano devkit by default.

[    2.176383] tegra194-pcie 14100000.pcie: host bridge /bus@0/pcie@14100000 ranges:
[    2.176409] tegra194-pcie 14100000.pcie:      MEM 0x2080000000..0x20a7ffffff -> 0x2080000000
[    2.176419] tegra194-pcie 14100000.pcie:      MEM 0x20a8000000..0x20afffffff -> 0x0040000000
[    2.176423] tegra194-pcie 14100000.pcie:       IO 0x0030100000..0x00301fffff -> 0x0030100000
[    2.176893] tegra194-pcie 14100000.pcie: iATU unroll: enabled
[    2.176896] tegra194-pcie 14100000.pcie: Detected iATU regions: 8 outbound, 2 inbound
[    2.296968] hub 1-3:1.0: USB hub found
[    2.297584] hub 1-3:1.0: 4 ports detected
[    2.591362] usb 1-3.4: new full-speed USB device number 3 using tegra-xusb
[    3.285405] tegra194-pcie 14100000.pcie: Phy link never came up
[    4.287555] tegra194-pcie 14100000.pcie: Phy link never came up
[    4.287619] tegra194-pcie 14100000.pcie: PCI host bridge to bus 0001:00

Ok so you mean with default ODM data "gbe-uphy-config-8,hsstp-lane-map-3,hsio-uphy-config-0" we should see NVMe detected by the PCIe C1 controller 14100000, on PCIE1_XXX pins:

  1. What does 14100000.pcie: Phy link never came up exactly mean?
  2. Should I enable debug logs of PCIe driver, or any other info that will help?

I have tried with 2 different SD Express PCIe cards:

Both cards are detected with Nvidia Xavier NX in this carrier board, as NVMe storage connected with PCIe. Also detected with computer using USB-C UHS-II card reader. What can be the difference Xavier NX → Orin NX?
Note that I’m using Orin Nano devkit configuration from BSP when flashing Orin.

  1. Flash command - you mean I should use this one from Quick Start?

Jetson Orin Nano Developer Kit (NVMe):

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit internal

It means link detection fails. The Link training between Jetson and your PCIe device is failing.

Should I enable debug logs of PCIe driver, or any other info that will help?

There is a debug tip document for it. It is general one for both rel-35 and rel-36 pcie debug. Lots of reasons could lead to this.

https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/HR/JetsonModuleAdaptationAndBringUp/JetsonAgxOrinSeries.html?highlight=pcie#debug-pcie-link-up-failure

Flash command - you mean I should use this one from Quick Start?

Yes, just use this command to flash.

Also, you could also try rel-35 too because we have some users mentioning their PCIe detection is not as stable as rel-35 when moving to rel-36. I wonder if your case is related too.

1 Like

Ok, thank you. For completeness, I now used the command with --network usb0 (using release 36.3 - still NVMe not detected 14100000.pcie: Phy link never came up):

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit internal

Flashing output:
orin-36.3-network-usb-flashing-output.log (262.4 KB)

Serial output:
orin-36.3-network-usb-serial.log (80.6 KB)

I downloaded and tried older release. With BSP 35.2.1, I used command to flash Jetson Orin NX + Xavier NX Devkit (NVMe) - also needed to change CVB EEPROM size to 0 beforehand:

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_external.xml -p "-c bootloader/t186ref/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 p3509-a02+p3767-0000 internal

The flashing output:
orin-35.2.1-flashing-output.log (231.6 KB)

Serial output:
orin-35.2.1-serial.log (89 KB)

Unfortunately similar error.
So now I’m going to check all debugging steps from Jetson AGX Orin Platform Adaptation and Bring-Up — NVIDIA Jetson Linux Developer Guide 1 documentation

Just a comment that may not matter.

I am not sure why picking 35.2.1 as there are 35.5.

If you want to compare rel-35 and rel-36, just pick the latest one from both of them. But not some old version.

That’s because the previous (Xavier NX) version of the product uses 35.2.1 and it would be easier to integrate it if it turned out to work, also it’s a first release with support for Orin NX.

I don’t have high hopes but will test 35.5 just to have 100% coverage.

A question regarding debugging: the PCIe troubleshooting page mentions using lspci a few times. I don’t have it in initrd. I can access some of the info manually from /sys/bus/pci files, but it’s inconvenient and there is some binary information that is decoded by lspci. Any chance to boot to Linux (without NVMe) where I will have lspci and devmem?

You can boot from a usb drive first when debugging such issue.

Hi Wayne,

I have cross-compiled lspci for initrd, and the devmem was already available as busybox devmem. I have done the debugging steps:

  1. Verify DLActive status in Root port LnkSta of lspci -vvv output. This is to check whether the link comes up by the time kernel boots to shell (for example, to confirm whether the link is taking more time to come up).

    • Full lspci -vvv output:
      orin-lspci-vvv.log (17.2 KB)
    • bus domain 0001 seems to be not included in the lspci. This is despite the nvidia,disable-power-down; set in the node. Is it expected? On running device you can see ==> /proc/device-tree/bus@0/pcie@14100000/nvidia,disable-power-down <==
      Included in lspci are only: domain 0004 → PCI node 14160000 (root port for WiFi card), and domain 0008 → PCI node 140a0000 (root port for Realtek Gigabit Ethernet)
    • for reference: pci node 14100000 dumped from running system with find /proc/device-tree/ -type f -exec head {} +
      orin-device-tree-with-nvidia-disable-power-down.txt (2.4 KB)
  2. Dump PADCTL_PEX_CTL_PEX_L*_CLKREQ_N_0 and PADCTL_PEX_CTL_PEX_L*_RST_N_0 pinmux values and check if settings are correct.

    • PADCTL_PEX_CTL in TRM is referred to as PADCTL_A7, which is at 0x02437000.

      • PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0 = 0x02437020
      • PADCTL_PEX_CTL_PEX_L0_RST_N_0 = 0x02437028
      • PADCTL_PEX_CTL_PEX_L1_CLKREQ_N_0 = 0x02437010
      • PADCTL_PEX_CTL_PEX_L1_RST_N_0 = 0x02437018
      bash-5.1# busybox devmem 0x02437020
      0x00000415
      bash-5.1# busybox devmem 0x02437028
      0x00000415
      bash-5.1# busybox devmem 0x02437010
      0x00000470
      bash-5.1# busybox devmem 0x02437018
      0x00000420
      
  3. Dump PCIE_RP_APPL_DEBUG_0 register, refer to TRM for register address of each controller. Accessing the controller’s address, which is not enabled, will cause a CBB power down error. When you share this information in NVIDIA developer forum, it will help us determine the LTSSM state.

    • PCIE_RP_APPL_DEBUG_0 - Offset: 0xd0 (present in every PCIe instance, C0, C1 etc.)
    • PCIE_C1_CTL = 0x14100000
    • 0x141000d0 read gives 0xFFFFFFFF and a CBB error. For other PCI instance like 0x141600d0 which is up it gives a valid value 0x00000088. Which means the PCIe C1 is down.
  4. Reduce the link speed to Gen-1 and link width to x1 using device tree properties.

    • Reduced already in device tree before above testing: nvidia,max-speed = <0x01>; only one lane in use num-lanes = <1>;

Given the above, how can I make sure that the PCIe C1 will not be powered down (for proper debugging) despite the link cannot be setup?
I have added property nvidia,disable-power-down which is a boolean property so it should be enough to add the line nvidia,disable-power-down; in the PCI node, correct?

If you are using rel-36, then nvidia,disable-power-down no longer works.

Please apply this patch instead.

Ok thank you.
I applied the patch and flashed the device, I can see that the bus stays up despite not having the link.

[    6.449208] tegra194-pcie 14100000.pcie: Disabling PCIe power down

Debug info:

  1. lspci -vvv output: there’s DLActive-. Full output:
    orin-lspci-vvv-no-power-down.log (26.8 KB)

  2. Dump PADCTL_PEX_CTL_PEX_L*_CLKREQ_N_0 and PADCTL_PEX_CTL_PEX_L*_RST_N_0 pinmux values and check if settings are correct. Same as above

  3. Dump PCIE_RP_APPL_DEBUG_0 register. PCIe stays up and we can read the real value:

busybox devmem 0x141000d0
0x00001818

which means SMLH_LTSSM_STATE has value 0x03, which means S_POLL_COMPLIANCE according to the encoding in Orin Technical Reference Manual. What does it mean?

  1. Reduce the link speed to Gen-1 and link width to x1 using device tree properties - done already.

What are the next debugging steps?

This pinmux results seems to be weird that 0x470 means tristatae bit is not set to passthrough.

I also dumped the result from my NV devkit and the value is different from yours. NV devkit showed 0x460.

sudo busybox devmem 0x02437010
0x00000460

1 Like