Jetson Xavier not booting with PCIe x1 Switch

Hi

We face a problem with a custom carrier board and the jetson agx xavier module. When we connect a Expansion Board to our carrier board with two cascaded PCIe Switches (PI7C9X2G404SV from Diodes) on it, the system is stuck during boot. The Log File you can find attached. The device then will restart again and again. We saw this behavior with JetPack 4.2.2 when we did not change anything. Other PCIe Devices work on the same expansion interface and the expansion board with the two PCIe Switches works fine in an x86 system. The used PCIe Controller is C5. We only us it as x1 PCIe Interface.
Thank you for your help.

Kind regards
log.txt (89.6 KB)

I’m suspecting switch not pulling the CLKREQ signal LOW properly.
Can you please apply the following change and see if the issue disappears?

diff --git a/drivers/pci/dwc/pcie-tegra.c b/drivers/pci/dwc/pcie-tegra.c
index ee5fbfa2f7c9..3980c771455c 100644
--- a/drivers/pci/dwc/pcie-tegra.c
+++ b/drivers/pci/dwc/pcie-tegra.c
@@ -4535,6 +4535,11 @@ static int tegra_pcie_dw_runtime_resume(struct device *dev)
        val |= (APPL_CFG_MISC_ARCACHE_VAL << APPL_CFG_MISC_ARCACHE_SHIFT);
        writel(val, pcie->appl_base + APPL_CFG_MISC);

+       val = readl(pcie->appl_base + APPL_PINMUX);
+       val |= APPL_PINMUX_CLKREQ_OVERRIDE_EN;
+       val &= APPL_PINMUX_CLKREQ_OVERRIDE;
+       writel(val, pcie->appl_base + APPL_PINMUX);
+
        if (pcie->disable_clock_request) {
                val = readl(pcie->appl_base + APPL_PINMUX);
                val |= APPL_PINMUX_CLKREQ_OUT_OVRD_EN;
@@ -4718,6 +4723,11 @@ static int tegra_pcie_dw_resume_noirq(struct device *dev)
        val |= (APPL_CFG_MISC_ARCACHE_VAL << APPL_CFG_MISC_ARCACHE_SHIFT);
        writel(val, pcie->appl_base + APPL_CFG_MISC);

+       val = readl(pcie->appl_base + APPL_PINMUX);
+       val |= APPL_PINMUX_CLKREQ_OVERRIDE_EN;
+       val &= APPL_PINMUX_CLKREQ_OVERRIDE;
+       writel(val, pcie->appl_base + APPL_PINMUX);
+
        if (pcie->disable_clock_request) {
                val = readl(pcie->appl_base + APPL_PINMUX);
                val |= APPL_PINMUX_CLKREQ_OUT_OVRD_EN;

In case if your release is using pcie-tegra-dw.c file, please make a similar change in that file (Old releases were using pcie-tegra-dw.c file hence this suggestion)

Hi vidyas

Thank you for the patch. Sadly with it, our system does not even boot without the expansion board. It gets stuck and starts rebooting after a while.
We came across another strange behavior with the PCIe switches:

  • When we go over one PCIe Switch with 2 NICs connected to it to a second switch with again two NICs connected, everything works fine and the system boots without problems. Also all the PCIe Devices seem to work.

  • If there are no devices connected to the first PCIe Switch except of the second cascaded PCIe Switch with its 2 NICs, we see the explained behavior.

Do you have any other idea?
Thank you.

Thank you for the patch. Sadly with it, our system does not even boot without the expansion board. It gets stuck and starts rebooting after a while

This is strange. Are you able to update the kernel otherwise? i.e leave the above change, but if you just make some trivial change (say adding an extra print message), are you able to update the kernel? The above change can’t cause system lockup issues for sure.

Hi vidyas

we are now one step further. A configuration of the PCIe Switch seemed to be the problem.
Now we have a system that is always booting. Sadly another error occurs now. After some random runtime, we get the message “PCIe link lost, device now detached”.

[  123.430376] igb 0005:07:00.0 eth1: PCIe link lost, device now detached
[  123.434756] ------------[ cut here ]------------
[  123.438811] WARNING: CPU: 6 PID: 1701 at /dvs/git/dirty/git-master_linux/kernel/kernel-4.9/drivers/net/ethernet/intel/igb/igb_main.c:766 igb_rd32+0xb4/0xc0
[  123.443113] Modules linked in: zram overlay cdc_acm nvgpu bluedroid_pm ip_tables x_tables

[  123.443209] CPU: 6 PID: 1701 Comm: kworker/6:2 Tainted: G        W       4.9.140-tegra #1
[  123.443213] Hardware name: Jetson-AGX (DT)
[  123.443230] Workqueue: events igb_watchdog_task
[  123.443242] task: ffffffc3ec2c9c00 task.stack: ffffffc3d9ff8000
[  123.443249] PC is at igb_rd32+0xb4/0xc0
[  123.443256] LR is at igb_rd32+0x88/0xc0
[  123.443290] pc : [<ffffff80088c8584>] lr : [<ffffff80088c8558>] pstate: 00c00045
[  123.443296] sp : ffffffc3d9ffbc90
[  123.443302] x29: ffffffc3d9ffbc90 x28: 0000000000000000 
[  123.443317] x27: 0000000000000000 x26: 0000000000000000 
[  123.443357] x25: 0000000000000000 x24: 000000000000c030 
[  123.443372] x23: ffffffc3d87421c0 x22: 0000000000000001 
[  123.443385] x21: ffffffc3d8718000 x20: ffffffc3d8718eb0 
[  123.443399] x19: 00000000ffffffff x18: 0000000000000010 
[  123.443412] x17: 0000000000000118 x16: ffffff8008150338 
[  123.443426] x15: 0000000000000001 x14: 6568636174656420 
[  123.443438] x13: 776f6e2065636976 x12: 6564202c74736f6c 
[  123.443451] x11: 206b6e696c206549 x10: 0000000000000538 
[  123.443468] x9 : 20302e30303a3730 x8 : ffffff80083d4798 
[  123.443482] x7 : ffffff8009e94358 x6 : ffffffc3ffe03bf0 
[  123.443496] x5 : ffffffc3ffe03bf0 x4 : 0000000000000000 
[  123.443509] x3 : ffffffc3ffe097f8 x2 : ffffffc3ffe03bf0 
[  123.443551] x1 : 0000000000000001 x0 : ffffff800a074000 

[  123.443568] ---[ end trace 91b5c38742e1ff03 ]---
[  123.447644] Call trace:
[  123.447685] [<ffffff80088c8584>] igb_rd32+0xb4/0xc0
[  123.447695] [<ffffff80088cec44>] igb_update_stats+0x94/0x8a8
[  123.447704] [<ffffff80088cf554>] igb_watchdog_task+0xfc/0x750
[  123.447716] [<ffffff80080d4f3c>] process_one_work+0x1e4/0x4b0
[  123.447750] [<ffffff80080d5258>] worker_thread+0x50/0x4c8
[  123.447761] [<ffffff80080dbee4>] kthread+0xec/0xf0
[  123.447771] [<ffffff8008083850>] ret_from_fork+0x10/0x40

With ifconfig, the ethernet interface shows then many errors and collisions:

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.47  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::f355:184d:fcca:9139  prefixlen 64  scopeid 0x20<link>
        ether 00:a0:10:01:d2:52  txqueuelen 1000  (Ethernet)
        RX packets 1009  bytes 77200 (77.2 KB)
        RX errors 1786706394720  dropped 446676598681  overruns 2233382993400  frame 1786706394720
        TX packets 97  bytes 11761 (11.7 KB)
        TX errors 893353197360  dropped 0 overruns 0  carrier 893353197360  collisions 446676598680
        device memory 0x1f40000000-1f400fffff

Any idea what could be the problem?
Thank you.

Not really sure. I think the switch is playing a role here

What do we have to do with the signal PEX_L5_RST_N if it is unused? On the Jetson AGX Xavier Module, it has pull-ups to 3.3V. We have just the signals not connected on our carrier board and in our pinmux we set it to unused. Is that correct?

Nope. It can’t be an unused signal. It must go to the downstream device’s PERST. This is a spec defined signal that applies reset to the downstream device.

Sorry we meant the signal PEX_L5_CLKREQ_N. What about this?

You can leave it to float provided, patch from comment #2 is present in the code.

Hi vidyas

We finally found a solution. The problem seems to be the active state power management of the pcie switch. With the kernel argument “pcie_aspm=off” we have not seen anymore boot or disconnection problems with our expansion board.
Thank you for your help.

Good to hear that your issue got resolved.
Thanks for updating us.