Orin NX can't detect NVMe storage (SD Express PCIe)

Hi! I’m using Orin NX 16GB module with a custom carrier board, which has already been in production use with Xavier NX.

I’m using Nvidia BSP 36.3, and the board with Orin can boot into initrd, where it cannot find the NVMe device. This is essential as Orin doesn’t have internal eMMC.

Error: Could not stat device /dev/nvme0n1 - No such file or directory.

Using Workflow 3:
Building the image:

sudo ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" ./tools/kernel_flash/l4t_initrd_flash.sh --no-flash --external-device nvme0n1p1 -c ./tools/kernel_flash/flash_l4t_t234_nvme.xml --showlogs jetson-orin-nano-devkit nvme0n1p1

full output of build:
jetson-orin-build-3.log (261.0 KB)

Flashing the device:

sudo ADDITIONAL_DTB_OVERLAY_OPT="BootOrderNvme.dtbo" ./tools/kernel_flash/l4t_initrd_flash.sh --flash-only --external-device nvme0n1p1 -c ./tools/kernel_flash/flash_l4t_t234_nvme_orb.xml --showlogs jetson-orin-nano-devkit nvme0n1p1

full output of flashing:
jetson-orin-flash-3-no-nvme.log (8 KB)

full serial output:
jetson-orin-serial-output-3.log (60.6 KB)

The PCIe controllers seem to come up, and the NVMe kernel modules are loaded. However on serial you can see

Connection timeout: device /dev/nvme0n1 is still not ready.

The SD Express PCIe card is connected to PCIE1_XXX pins on Jetson, so it should not change between Xavier NX and Orin NX.
I reviewed the differences in pinout between Xavier NX and Orin NX in Jetson_OrinNano_OrinNX_XavierNX_Interface_Comparison_Migration_DA-11081-001_v1.1.pdf, and

  • SDMMC interface pins from Xavier are PCIE3 pins on Orin - so not related
  • CSI4 interface pins from Xacier are PCIE2 pins on Orin - so not related
    There are no differences listed for PCIE1 pins.

One change we had to make:
DevKit has EEPROM containing info about devkit used by Nvidia to dermine the devkit version but our carrier board does not have it, to make the board boot we had to change the EEPROM size from 0x100 to 0x0 in bootloader/generic/BCT/tegra234-mb2-bct-misc-p3767-0000.dts:

cvb_eeprom_read_size = <0x0>

Here the PCIe/SD interface, connected to PCIE1_XXX pins, which are visible in the schematic:

Already confirmed that the correct voltage for SD card is present on the 3.3V pin.

Should we make changes to ODMDATA, device tree or pinmux to be able to detect the NVMe?

I’ll be grateful for all pointers and help.

  1. The flash command is not correct.
    Quick Start — NVIDIA Jetson Linux Developer Guide 1 documentation

  2. We didn’t validate any SD express PCIe card before. From the log, I only saw it cannot get detected in PCIe C1 controller. No ODMDATA change is needed. It is enabled on Orin Nano devkit by default.

[    2.176383] tegra194-pcie 14100000.pcie: host bridge /bus@0/pcie@14100000 ranges:
[    2.176409] tegra194-pcie 14100000.pcie:      MEM 0x2080000000..0x20a7ffffff -> 0x2080000000
[    2.176419] tegra194-pcie 14100000.pcie:      MEM 0x20a8000000..0x20afffffff -> 0x0040000000
[    2.176423] tegra194-pcie 14100000.pcie:       IO 0x0030100000..0x00301fffff -> 0x0030100000
[    2.176893] tegra194-pcie 14100000.pcie: iATU unroll: enabled
[    2.176896] tegra194-pcie 14100000.pcie: Detected iATU regions: 8 outbound, 2 inbound
[    2.296968] hub 1-3:1.0: USB hub found
[    2.297584] hub 1-3:1.0: 4 ports detected
[    2.591362] usb 1-3.4: new full-speed USB device number 3 using tegra-xusb
[    3.285405] tegra194-pcie 14100000.pcie: Phy link never came up
[    4.287555] tegra194-pcie 14100000.pcie: Phy link never came up
[    4.287619] tegra194-pcie 14100000.pcie: PCI host bridge to bus 0001:00

Ok so you mean with default ODM data "gbe-uphy-config-8,hsstp-lane-map-3,hsio-uphy-config-0" we should see NVMe detected by the PCIe C1 controller 14100000, on PCIE1_XXX pins:

  1. What does 14100000.pcie: Phy link never came up exactly mean?
  2. Should I enable debug logs of PCIe driver, or any other info that will help?

I have tried with 2 different SD Express PCIe cards:

Both cards are detected with Nvidia Xavier NX in this carrier board, as NVMe storage connected with PCIe. Also detected with computer using USB-C UHS-II card reader. What can be the difference Xavier NX → Orin NX?
Note that I’m using Orin Nano devkit configuration from BSP when flashing Orin.

  1. Flash command - you mean I should use this one from Quick Start?

Jetson Orin Nano Developer Kit (NVMe):

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit internal

It means link detection fails. The Link training between Jetson and your PCIe device is failing.

Should I enable debug logs of PCIe driver, or any other info that will help?

There is a debug tip document for it. It is general one for both rel-35 and rel-36 pcie debug. Lots of reasons could lead to this.

Flash command - you mean I should use this one from Quick Start?

Yes, just use this command to flash.

Also, you could also try rel-35 too because we have some users mentioning their PCIe detection is not as stable as rel-35 when moving to rel-36. I wonder if your case is related too.

1 Like

Ok, thank you. For completeness, I now used the command with --network usb0 (using release 36.3 - still NVMe not detected 14100000.pcie: Phy link never came up):

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_t234_nvme.xml -p "-c bootloader/generic/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 jetson-orin-nano-devkit internal

Flashing output:
orin-36.3-network-usb-flashing-output.log (262.4 KB)

Serial output:
orin-36.3-network-usb-serial.log (80.6 KB)

I downloaded and tried older release. With BSP 35.2.1, I used command to flash Jetson Orin NX + Xavier NX Devkit (NVMe) - also needed to change CVB EEPROM size to 0 beforehand:

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1 \
  -c tools/kernel_flash/flash_l4t_external.xml -p "-c bootloader/t186ref/cfg/flash_t234_qspi.xml" \
  --showlogs --network usb0 p3509-a02+p3767-0000 internal

The flashing output:
orin-35.2.1-flashing-output.log (231.6 KB)

Serial output:
orin-35.2.1-serial.log (89 KB)

Unfortunately similar error.
So now I’m going to check all debugging steps from Jetson AGX Orin Platform Adaptation and Bring-Up — NVIDIA Jetson Linux Developer Guide 1 documentation

Just a comment that may not matter.

I am not sure why picking 35.2.1 as there are 35.5.

If you want to compare rel-35 and rel-36, just pick the latest one from both of them. But not some old version.

That’s because the previous (Xavier NX) version of the product uses 35.2.1 and it would be easier to integrate it if it turned out to work, also it’s a first release with support for Orin NX.

I don’t have high hopes but will test 35.5 just to have 100% coverage.

A question regarding debugging: the PCIe troubleshooting page mentions using lspci a few times. I don’t have it in initrd. I can access some of the info manually from /sys/bus/pci files, but it’s inconvenient and there is some binary information that is decoded by lspci. Any chance to boot to Linux (without NVMe) where I will have lspci and devmem?

You can boot from a usb drive first when debugging such issue.

Hi Wayne,

I have cross-compiled lspci for initrd, and the devmem was already available as busybox devmem. I have done the debugging steps:

  1. Verify DLActive status in Root port LnkSta of lspci -vvv output. This is to check whether the link comes up by the time kernel boots to shell (for example, to confirm whether the link is taking more time to come up).

    • Full lspci -vvv output:
      orin-lspci-vvv.log (17.2 KB)
    • bus domain 0001 seems to be not included in the lspci. This is despite the nvidia,disable-power-down; set in the node. Is it expected? On running device you can see ==> /proc/device-tree/bus@0/pcie@14100000/nvidia,disable-power-down <==
      Included in lspci are only: domain 0004 → PCI node 14160000 (root port for WiFi card), and domain 0008 → PCI node 140a0000 (root port for Realtek Gigabit Ethernet)
    • for reference: pci node 14100000 dumped from running system with find /proc/device-tree/ -type f -exec head {} +
      orin-device-tree-with-nvidia-disable-power-down.txt (2.4 KB)
  2. Dump PADCTL_PEX_CTL_PEX_L*_CLKREQ_N_0 and PADCTL_PEX_CTL_PEX_L*_RST_N_0 pinmux values and check if settings are correct.

    • PADCTL_PEX_CTL in TRM is referred to as PADCTL_A7, which is at 0x02437000.

      • PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0 = 0x02437020
      • PADCTL_PEX_CTL_PEX_L0_RST_N_0 = 0x02437028
      • PADCTL_PEX_CTL_PEX_L1_CLKREQ_N_0 = 0x02437010
      • PADCTL_PEX_CTL_PEX_L1_RST_N_0 = 0x02437018
      bash-5.1# busybox devmem 0x02437020
      0x00000415
      bash-5.1# busybox devmem 0x02437028
      0x00000415
      bash-5.1# busybox devmem 0x02437010
      0x00000470
      bash-5.1# busybox devmem 0x02437018
      0x00000420
      
  3. Dump PCIE_RP_APPL_DEBUG_0 register, refer to TRM for register address of each controller. Accessing the controller’s address, which is not enabled, will cause a CBB power down error. When you share this information in NVIDIA developer forum, it will help us determine the LTSSM state.

    • PCIE_RP_APPL_DEBUG_0 - Offset: 0xd0 (present in every PCIe instance, C0, C1 etc.)
    • PCIE_C1_CTL = 0x14100000
    • 0x141000d0 read gives 0xFFFFFFFF and a CBB error. For other PCI instance like 0x141600d0 which is up it gives a valid value 0x00000088. Which means the PCIe C1 is down.
  4. Reduce the link speed to Gen-1 and link width to x1 using device tree properties.

    • Reduced already in device tree before above testing: nvidia,max-speed = <0x01>; only one lane in use num-lanes = <1>;

Given the above, how can I make sure that the PCIe C1 will not be powered down (for proper debugging) despite the link cannot be setup?
I have added property nvidia,disable-power-down which is a boolean property so it should be enough to add the line nvidia,disable-power-down; in the PCI node, correct?

If you are using rel-36, then nvidia,disable-power-down no longer works.

Please apply this patch instead.

Ok thank you.
I applied the patch and flashed the device, I can see that the bus stays up despite not having the link.

[    6.449208] tegra194-pcie 14100000.pcie: Disabling PCIe power down

Debug info:

  1. lspci -vvv output: there’s DLActive-. Full output:
    orin-lspci-vvv-no-power-down.log (26.8 KB)

  2. Dump PADCTL_PEX_CTL_PEX_L*_CLKREQ_N_0 and PADCTL_PEX_CTL_PEX_L*_RST_N_0 pinmux values and check if settings are correct. Same as above

  3. Dump PCIE_RP_APPL_DEBUG_0 register. PCIe stays up and we can read the real value:

busybox devmem 0x141000d0
0x00001818

which means SMLH_LTSSM_STATE has value 0x03, which means S_POLL_COMPLIANCE according to the encoding in Orin Technical Reference Manual. What does it mean?

  1. Reduce the link speed to Gen-1 and link width to x1 using device tree properties - done already.

What are the next debugging steps?

This pinmux results seems to be weird that 0x470 means tristatae bit is not set to passthrough.

I also dumped the result from my NV devkit and the value is different from yours. NV devkit showed 0x460.

sudo busybox devmem 0x02437010
0x00000460

1 Like

Ok, let’s clarify the registers in use as I’m not sure if dumping PADCTL_PEX_CTL_PEX_L1_CLKREQ_N_0 was needed:

A: Does the PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0 correspond to lane 0 (of controller C1) and thus the PCIe lane using PCIE1_RX0? And PADCTL_PEX_CTL_PEX_L1_CLKREQ_N_0 corresponds to lane 1 (unrelated, used by USB)?
B: Or does PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0 relate to controller C0, and PADCTL_PEX_CTL_PEX_L1_CLKREQ_N_0 relate to controller C1? I can’t clearly figure it out from the TRM, can you help?

Recap:

We’re using PCIe controller #1 (C1, addr 14160000) which only uses single lane: lane 0.
In default pinmux sheet it looks like this:

So example pin name is PCIE1_RX0_P.

In Xavier/Orin NX migration, we can see pins PCIE1_RX0/TX0 attached to PCIe C1 on Orin NX:

As additional validation, we can see that on Xavier NX, it was attached to PCIe controller C4 (addr 14160000), and I can indeed reproduce the SD Express card being detected by the PCIe C4 using the same carrier board:

[    3.099549] tegra194-pcie 14160000.pcie: Link up
[    3.101145] tegra194-pcie 14160000.pcie: PCI host bridge to bus 0004:00
[...]
[    3.141154] nvme 0004:01:00.0: Adding to iommu group 6
[    3.146633] nvme nvme0: pci function 0004:01:00.0
[    3.149676] nvme 0004:01:00.0: enabling device (0000 -> 0002)
[    3.203580]  nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15

In the Orin test, I don’t have any changes to device tree or pinmux other than the “disable power down” patch.

We’re using PCIe controller #1 (C1 , addr 14160000)

The address you posted is C4. Not C1. A typo?

As for the pinmux, I did a double check with internal team. CLKREQ is only for L1SS and does not affect link detection. You could just ignore that.

And for your question, PADCTL_PEX_CTL_PEX_L1_CLKREQ_N is for C1. Not PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0.


It is common to see S_POLL_COMPLIANCE when link detection fails. We’ve seen such result multiple times.
However, it does not give much hint.

Is such card or similar thing able to run on NV devkit instead of your board?

Yes, it was a typo. It’s 14100000, controller C1 on Orin, where these pins are handled on Xavier by 14160000, controller C4.

And for your question, PADCTL_PEX_CTL_PEX_L1_CLKREQ_N is for C1. Not PADCTL_PEX_CTL_PEX_L0_CLKREQ_N_0.

Thank you for clarification. So for controller C1 we still care for PADCTL_PEX_CTL_PEX_L1_RST_N_0 (addr 0x02437018, value 0x00000420). I understand it as: PE1, no pull-up, passthrough, E_IO_HV enabled, GPIO_SF_SEL: SFIO, rest disabled. Does it look correct?

About other possible properties needed, here’s the comparison between the device tree properties on

  • Xavier NX (working with this carrier board) Jetson_Linux_R35.2.1_aarch64/Linux_for_Tegra/sources/hardware/nvidia/platform/t19x/jakku/kernel-dts/tegra194-p3668-0001-p3509-0000.dts
  • and Orin NX (devkit device tree) Jetson_Linux_R36.3.0_aarch64/Linux_for_Tegra/source/hardware/nvidia/t23x/nv-public/nv-platform/tegra234-p3768-0000+p3767-xxxx-nv-common.dtsi :
    Xavier-Orin-pcie-diff - Diffchecker
    The difference I see is in nvidia,disable-aspm-states.

Trying to find other hints, in the guide, there is also this part that mentions using more MSI interrupts, do you suggest making changes there?

Is such card or similar thing able to run on NV devkit instead of your board?

I can run a test on Nvidia devkit. However it’s SD card slot is UHS-I connected to SDMMC peripheral, not PCIe, so I’ll need to wire and solder our SD Express adapter to PCIe pins on devkit.

Hi,

nvidia,disable-aspm-states .

You could try this. Disable ASPM might help.

Trying to find other hints, in the guide, there is also this part that mentions using more MSI interrupts, do you suggest making changes there?

This does not help. Not related.

Actually dumping the LA trace by using PCIe analyzer might clarify. But I guess you don’t have such tools as most of users don’t have it.

I did an experiment with ASPM:

        LnkCap:	Port #0, Speed 16GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
			ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  • debugfs shows
# cat /sys/kernel/debug/pcie\@14100000/aspm_state_cnt 
Tx L0s entry count : 0
Rx L0s entry count : 0
Link L1 entry count : 0
Link L1.1 entry count : 0
Link L1.2 entry count : 0

In both cases I have tegra194-pcie 14100000.pcie: Phy link never came up .

Actually dumping the LA trace by using PCIe analyzer might clarify. But I guess you don’t have such tools as most of users don’t have it.

I have this logic analyzer which can sample up to 1GHz, is it fast enough for these lines?

One trick that forgot to mention, have you tried to rebind the driver after using disable power down patch? Though I guess it might not work in DLActive- situation.

e.g. toggle such node (should check which one is C1)

 /sys/bus/pci/devices/0000:00:00.1/driver/unbind

I’ve seen it in some other thread before, and I did

echo 14100000.pcie > unbind
echo 14100000.pcie > bind 

However there’s no reaction in dmesg, as you said it doesn’t matter for DLActive-.

Let me go back and check this.