PCIe Device - No Link

I am running Jetpack 4.1 on the Xavier and having troubles with PCIe.

The device I am using shows link activation for only a second or two on the initial boot of the Xavier then it goes down. The boot log doesn’t seem to have anything unusual as far as I can tell. I have tried the Xavier with some more common PCIe devices (Intel X520) which work as expected. The device in question is fine on x86 Ubuntu Linux.

What are the next steps to take to debug why this device does not stay up?

1 Like

Here is the output from dmesg for the bus with an Intel NIC -

$ dmesg | grep 0005:00
[    7.981355] tegra-pcie-dw 141a0000.pcie: PCI host bridge to bus 0005:00
[    7.981361] pci_bus 0005:00: root bus resource [bus 00-ff]
[    7.981365] pci_bus 0005:00: root bus resource [io  0x300000-0x3fffff] (bus address [0x3a100000-0x3a1fffff])
[    7.981388] pci_bus 0005:00: root bus resource [mem 0x3a200000-0x3bffffff]
[    7.981391] pci_bus 0005:00: root bus resource [mem 0x1c00000000-0x1fffffffff pref]
[    7.981410] pci 0005:00:00.0: [10de:1ad0] type 01 class 0x060400
[    7.981560] pci 0005:00:00.0: PME# supported from D0 D3hot D3cold
[    7.981728] iommu: Adding device 0005:00:00.0 to group 63
[    8.035431] pci 0005:00:00.0: BAR 14: assigned [mem 0x3a200000-0x3a7fffff]
[    8.035434] pci 0005:00:00.0: BAR 13: assigned [io  0x300000-0x300fff]
[    8.035843] pci 0005:00:00.0: PCI bridge to [bus 01-ff]
[    8.035847] pci 0005:00:00.0:   bridge window [io  0x300000-0x300fff]
[    8.035852] pci 0005:00:00.0:   bridge window [mem 0x3a200000-0x3a7fffff]
[    8.035868] pci 0005:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[    8.036252] pcieport 0005:00:00.0: Signaling PME through PCIe PME interrupt
[    8.036260] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded
[    8.036418] aer 0005:00:00.0:pcie002: service driver aer loaded

and here is the output for my device -

$ dmesg | grep 0005:00
[    8.539247] tegra-pcie-dw 141a0000.pcie: PCI host bridge to bus 0005:00
[    8.540642] pci_bus 0005:00: root bus resource [bus 00-ff]
[    8.542074] pci_bus 0005:00: root bus resource [io  0x300000-0x3fffff] (bus address [0x3a100000-0x3a1fffff])
[    8.543549] pci_bus 0005:00: root bus resource [mem 0x3a200000-0x3bffffff]
[    8.544928] pci_bus 0005:00: root bus resource [mem 0x1c00000000-0x1fffffffff pref]
[    8.546414] pci 0005:00:00.0: [10de:1ad0] type 01 class 0x060400
[    8.546600] pci 0005:00:00.0: PME# supported from D0 D3hot D3cold
[    8.546812] iommu: Adding device 0005:00:00.0 to group 63
[    8.548389] pci 0005:00:00.0: PCI bridge to [bus 01-ff]
[    8.549792] pci 0005:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[    8.551485] pcieport 0005:00:00.0: Signaling PME through PCIe PME interrupt
[    8.552896] pcie_pme 0005:00:00.0:pcie001: service driver pcie_pme loaded
[    8.553006] aer 0005:00:00.0:pcie002: service driver aer loaded
[    8.553178] pcie_pme 0005:00:00.0:pcie001: unloading service driver pcie_pme
[    8.553225] aer 0005:00:00.0:pcie002: unloading service driver aer
[    8.553386] iommu: Removing device 0005:00:00.0 from group 63
[    8.554841] pci_bus 0005:00: busn_res: [bus 00-ff] is released

dmesg.log (8.43 KB)
lspci.log (6.6 KB)

I can’t answer, but you will want to include a verbose lspci. If you run “sudo lspci -vvv 2>&1 | tee log_lspci.txt” you can attach that to your thread (hover your mouse over the quote icon in the upper right, and the paper clip icon will show up for attaching files). If you can do this both before and after the failure it would be best, but after would probably be fine if this is all you can log.

Also, can you please share all lines w.r.t PCIe in the log? “dmesg | grep -i pci” ??

Both logs are attached to my previous post but they don’t seem to shed any more light on what is going on?

The PCIe error mechanism does not show any errors. Was this lspci before or after failure? If after, then the cause isn’t PCIe, but something further down the chain of drivers.

On the other hand, the end of dmesg shows the AER mechanism is shutting down the bus:

[    9.421326] aer 0005:00:00.0:pcie002: unloading service driver <b>aer</b>
[    9.421386] pci_bus 0005:01: busn_res: [bus 01-ff] is released
[    9.423582] pci_bus 0005:00: busn_res: [bus 00-ff] is released
[    9.423873] tegra-pcie-dw 141a0000.pcie: PCIe link is not up...!

Someone else may know why the lspci AER shows no error, and then dmesg claims AER as a reason for shutdown. Or maybe I’m just interpreting “unloading service driver aer” incorrectly.

Or maybe I’m just interpreting “unloading service driver aer” incorrectly.
Its wrong interpretation actually. Since there is no PCIe device found, AER service driver which was loaded for root port is getting unloaded as the host controller would shutdown the controller itself. So, this print is expected.

@ c_seymour,
How are you able to say that the link is up momentarily? because from the log, it looks like the PCIe link never came up. BTW, what kind of a PCIe endpoint device is this? Is this based on an FPGA? Also, did you happen to check link up on any other platform (like x86)?
Also, do we have CLKREQ signal routing from your PCIe endpoint to root port here?

How are you able to say that the link is up momentarily?

The LEDs on the PCIe adapter card are green for ~10 seconds until the bus is shutdown.

What kind of a PCIe endpoint device is this

Yes, FPGA and working fine on x86 Ubuntu Linux

Do we have CLKREQ signal routing from your PCIe endpoint to root port here?

Yes.

Sorry there isn’t much information to go on but I’m stumped.

Ok. LEDs are just indicating that power is available to the endpoint for a brief amount of time and not really indicating that PCIe link is up briefly. In fact, since the PCIe link didn’t come up within a specified time, power is cut down to the slot resulting in LEDs going off.
Since this is an FPGA based endpoint, I’m suspecting that the time elapsed waiting for PCIe link to come up may be small and hence I feel it is worth increasing the wait time.
Please try the below patch and see if that helps. Here I’m increasing the wait time before going for link up check from 100ms to 5 sec. In case if it doesn’t work with 5 sec delay, play around this value to see if it works for a higher delay.

diff --git a/drivers/pci/host/pcie-tegra-dw.c b/drivers/pci/host/pcie-tegra-dw.c
index 63ec46b3430b..4dcb089a2ed1 100644
--- a/drivers/pci/host/pcie-tegra-dw.c
+++ b/drivers/pci/host/pcie-tegra-dw.c
@@ -2351,7 +2351,7 @@ static void tegra_pcie_dw_host_init(struct pcie_port *pp)
        val |= APPL_PINMUX_PEX_RST;
        writel(val, pcie->appl_base + APPL_PINMUX);
 
-       msleep(100);
+       msleep(5000);
 
        val = readl(pp->dbi_base + CFG_LINK_STATUS_CONTROL);
        while (!(val & CFG_LINK_STATUS_DLL_ACTIVE)) {

So following the kernel customization documentation I ran source_sync.sh but I can not find pcie-tegra-dw.c. Do I need to specify a specific tag when doing the source_sync.sh?

$ ./source_sync.sh
...
$ find -name pcie-tegra-dw.c
$ tree ./rootfs/usr/src/linux-headers-4.9.108-tegra/drivers/pci/
./rootfs/usr/src/linux-headers-4.9.108-tegra/drivers/pci/
├── host
│   ├── Kconfig
│   └── Makefile
├── hotplug
│   ├── Kconfig
│   └── Makefile
├── Kconfig
├── Makefile
└── pcie
    ├── aer
    │   ├── Kconfig
    │   ├── Kconfig.debug
    │   └── Makefile
    ├── Kconfig
    └── Makefile

4 directories, 11 files

I should also note that the FPGA is externally powered and initialized before the Jetson boots.

Is there a way to keep VDD_12V power rail (which powers PCIe x16 slot) enabled even if OS doesn’t detect PCIe link?

And what is the output type of the PEX_CLK5_P/N signals on SoC (e.g. HCSL, LP-HCSL, LVDS etc)?

So you mentioned JetPack 4.1. Does R31.1 show up from:

head -n 1 /etc/nv_tegra_release

I don’t know if the source_sync.sh command you showed was just abbreviated from what was actually typed, but if not and if using R31.1, then for the kernel code download with source_sync.sh the command would go like this:

./source_sync.sh -k tegra-l4t-r31.1

Following patch can be used to keep 12V slot power flowing to slot even if PCIe link is not up

diff --git a/drivers/pci/host/pcie-tegra-dw.c b/drivers/pci/host/pcie-tegra-dw.c
index 63ec46b3430b..6e2ea926d4e9 100644
--- a/drivers/pci/host/pcie-tegra-dw.c
+++ b/drivers/pci/host/pcie-tegra-dw.c
@@ -3082,7 +3082,7 @@ static int tegra_pcie_dw_runtime_suspend(struct device *dev)
        reset_control_assert(pcie->core_apb_rst);
        clk_disable_unprepare(pcie->core_clk);
        regulator_disable(pcie->pex_ctl_reg);
-       config_plat_gpio(pcie, 0);
+       //config_plat_gpio(pcie, 0);

        if (pcie->cid != CTRL_5)
                uphy_bpmp_pcie_controller_state_set(pcie->cid, false);

It is LVDS

Great, we’re using Si53102-A3 clock buffer on the board c_seymour is bringing up so a DC-coupled LVDS input clock shall be fine for it.

I monitored PEX_CLK5_P signal and I’ve noticed that this PCIe clock is briefly enabled on power-on/reset, then disabled while OS boots, then enabled for about 2 ms at some stage of the boot process and then disabled again (presumably because it doesn’t detect PCIe link). Does this 2 ms PCIe clock enable period corresponds to anything in the code?

I’m not sure about the clock being present during power-on/reset, but, during boot, it should be available for around 100 ms and certainly not 2ms. Are you sure that it is 2ms? and also did you measure the frequency of it to be 100 MHz?

I’ll try repeating the measurement to double-check the 2 ms period. Unfortunately we only have a single-ended active probe, not a differential one, but it should be good enough for indicative measurements.
Yes, the frequency was 100 MHz.

If this is really 2ms, then, there is something really wrong. As I mentioned, it has to be around 100ms.

I’ve monitored VDD_12V (CH1), PEX_L5_RST_N_R (CH2) and PCIE_REFCLK_P (CH3) on power-on (see attachment).

It doesn’t seem to comply with PCI Express Card Electromechanical Specification, section 2.2: “On power up, the deassertion of PERST# is delayed 100 ms (TPVPERL) from the power rails achieving specified operating limits”

Scratch the 2 ms thing: it was a spurious output from PCIe clock buffer when it was powered down (when VDD_12V gets disabled with clock input being active).
1.png

1 Like