Occasionally `14160000.pcie: Phy link never came up`

BenSung · March 2, 2024, 6:06am

Hi guys, I met a problem that we cannot enumerate a 10G network card on 14160000.pcie. We use custom board, and the recurrence probability is low(1/50) in some carrier board and that may be higher in other board(3/5). Here is the full logs, containing four cases: the cold/hot boot + success/failure of enumerating the PCIe:

AQC113_cannot_found_logs.tar.gz (88.5 KB)

I try to extend the waiting time of PCIe by the following patch:

diff --git a/drivers/pci/controller/dwc/pcie-designware.c b/drivers/pci/controller/dwc/pcie-designware.c
index 5563979310ba..fe8d790db305 100644
--- a/drivers/pci/controller/dwc/pcie-designware.c
+++ b/drivers/pci/controller/dwc/pcie-designware.c
@@ -557,6 +557,7 @@ int dw_pcie_wait_for_link(struct dw_pcie *pci)
                        return 0;
                }
                usleep_range(LINK_WAIT_USLEEP_MIN, LINK_WAIT_USLEEP_MAX);
+               dev_info(pci->dev, "ben: wait loop (%d/%d)\n", retries+1, LINK_WAIT_MAX_RETRIES);
        }

        dev_info(pci->dev, "Phy link never came up\n");
diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
index 76c57d4fa714..fd4c78dab204 100644
--- a/drivers/pci/controller/dwc/pcie-designware.h
+++ b/drivers/pci/controller/dwc/pcie-designware.h
@@ -25,7 +25,7 @@
 #define DW_PCIE_VER_562A               0x3536322A

 /* Parameters for the waiting for link up routine */
-#define LINK_WAIT_MAX_RETRIES          10
+#define LINK_WAIT_MAX_RETRIES          100
 #define LINK_WAIT_USLEEP_MIN           90000
 #define LINK_WAIT_USLEEP_MAX           100000

Sometimes the enumeration success at the first try as before. While sometimes it needs about 30 tries, sometimes it needs about 60 tries and other times it will be failed even after100 tries. Any advice or suspicion do you have?

WayneWWW · March 2, 2024, 3:47pm

Does previous problem get resolved or not?

BenSung · March 2, 2024, 4:01pm

No, as this problem stuck the test. Tomorrow, I’ll skip it and continue testing.

WayneWWW · March 3, 2024, 8:17am

Could you use rel-35 to test this first to make sure rel-35 is fine?

Rel-36 is still a developer preview which is not proper for bring up.

BenSung · March 3, 2024, 8:32am

The problem is also occured with rel-35.3.1 as we tested before.

By the way, we’ve test such case yesterday:

I use a usb stick containing a rootfs and then chroot to that environment. Sometimes the pcie could be enumerate by reloading pcie_tegra194.

WayneWWW · March 3, 2024, 8:33am

Hi,

What is the exact issue you are talking about here? Why is usb stick + chroot needed? It sounds totally not related to PCIe.

BenSung · March 3, 2024, 8:39am

Sorry for confusing. The initial purpose is that we try to reload the driver to see if this help. As the external disk is also on the PCIe bus, we have to use a usb stick to store the rootfs.

WayneWWW · March 3, 2024, 8:42am

Let’s clarify the whole situation again

What is the exact pcie device that will lead the this error? You said 10G network card on 14160000.pcie would casue problem in the beginning.
Now you say the external disk also on the PCIe bus.

Which one is causing the error here? Both devices could cause boot hang?

BenSung · March 3, 2024, 8:54am

The first one, only 10G network card cause the problem.

The reason I mentioned external disk is that if I unload the pcie_tegra194, the external disk will off. The whole system is stored in that, so I cannot “reload the pcie driver” without helping of usb disk. The “external disk” I mean here is the nvme disk on pcie bus.

Sorry again for confusing. Please ignore the usb stick or disk, the point is that sometimes reloading the pcie driver can enumerate the 10G network card.

WayneWWW · March 3, 2024, 8:56am

Is it possible to test same 10G card on NVIDIA devkit instead of your board?

WayneWWW · March 3, 2024, 8:57am

Oh. Sorry. I remember that now. You are the guy who has no devkit.

WayneWWW · March 3, 2024, 8:59am

If rel-35 has no stuck problem, then why not just use bind and unbind to test?

Then you don’t need to use usb stick at all.

BenSung · March 3, 2024, 9:03am

OK, I’ll try that. Thanks for your such patient reply.

BenSung · March 14, 2024, 12:36pm

new phenomenon? the device was refound after the print Phy link never came up

[    5.928579] tegra194-pcie 14160000.pcie: Adding to iommu group 7
[    5.940982] tegra194-pcie 14160000.pcie: Using GICv2m MSI allocator
[    7.631488] tegra194-pcie 14160000.pcie: Using GICv2m MSI allocator
[    7.639410] tegra194-pcie 14160000.pcie: host bridge /pcie@14160000 ranges:
[    7.646580] tegra194-pcie 14160000.pcie:       IO 0x0036100000..0x00361fffff -> 0x0036100000
[    7.655263] tegra194-pcie 14160000.pcie:      MEM 0x2428000000..0x242fffffff -> 0x0040000000
[    7.663937] tegra194-pcie 14160000.pcie:      MEM 0x2140000000..0x2427ffffff -> 0x2140000000
[    8.779725] tegra194-pcie 14160000.pcie: Phy link never came up
[    8.785868] tegra194-pcie 14160000.pcie: PCI host bridge to bus 0004:00
[    8.792677] pci_bus 0004:00: root bus resource [bus 00-ff]
[    8.798316] pci_bus 0004:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x36100000-0x361fffff])
[    8.808410] pci_bus 0004:00: root bus resource [mem 0x2428000000-0x242fffffff] (bus address [0x40000000-0x47ffffff])
[    8.819232] pci_bus 0004:00: root bus resource [mem 0x2140000000-0x2427ffffff pref]
[    8.827148] pci 0004:00:00.0: [10de:229c] type 01 class 0x060400
[    8.833455] pci 0004:00:00.0: PME# supported from D0 D3hot
[    8.845912] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[    8.851303] pci 0004:00:00.0: Max Payload Size set to  256/ 256 (was  256), Max Read Rq  512
[    8.860210] pcieport 0004:00:00.0: Adding to iommu group 7
[    8.866045] pcieport 0004:00:00.0: PME: Signaling with IRQ 57
[    8.872453] pcieport 0004:00:00.0: AER: enabled with IRQ 57
[    8.875493] pcieport 0004:00:00.0: AER: Corrected error received: 0004:00:00.0
[    8.885631] pcieport 0004:00:00.0: AER: can't find device of ID0000
[    8.892863] pci_bus 0004:01: busn_res: [bus 01-ff] is released
[    8.898975] pci 0004:00:00.0: Removing from iommu group 7
[    8.904539] pci_bus 0004:00: busn_res: [bus 00-ff] is released

Here is the full_log:
the_ng.log (94.7 KB)

Besides, I compare the kernel output between two boots, before the PCIe never link up, there is such difference( left is the ok boot, while the right is the ng boot:

It seems that the ok boot contains the following log while ng boot not:

dce: dce_admin_ipc_handle_signal:90   Spurious signal on channel: [1]. Ignored...

And here is the two logs:
dmesg_ok.log (66.7 KB)
dmesg_ng.log (70.0 KB)

BenSung · March 15, 2024, 2:53am

I see, the DCE means display controller engine. There may not has much relation.

BenSung · April 2, 2024, 2:03am

Hi WayneWWW. After rechecking the release note of AQC113C, we may found out the reason of the problem. As the release note said, we need to switch to new method:

original method:

No Link
Train to Gen 1
Immediately speed change and train to Gen 3 (usually fails at this location)
Immediately speed change and train to Gen 4

new method:

No Link
Set host Train to Gen 1
Confirm Gen 1 linked up
Proceed to link to Gen 3
Proceed to speed change and link to Gen 4

We’ve check the device tree documents, and found out that there is a property called nvidia,init-link-speed which may do this. But we does not find out where it is been used. We also tried with the config set to <0x1>, but it does not take effect.

Questions:

If the tegra PCIe using the new method?
If the answer to Q1 is no, where is the implement of nvidia,init-link-speed if it is been implement?
For this circumstance, Could you help us solving it? If all done in kernel is complicated, may we set the speed with gen-1 during boot, and then set the speed to gen-4 from user space through sysfs?

WayneWWW · April 2, 2024, 4:20am

Could you try to set nvidia,init-speed = 0x1?

BenSung · April 2, 2024, 5:23am

With nvidia,init-speed = 0x1; I failed to build the dtb. So I change it to nvidia,init-speed = <0x1>;. But it does not help with enumerating the pcie device.

WayneWWW · April 2, 2024, 5:34am

Will max-link-speed = <1> give you the enumeration?

BenSung · April 2, 2024, 5:38am

Yes, with about 700 reboots, all the enumeration works.

Topic		Replies	Views
Boot stuck Jetson Orin NX pcie , boot	13	505	May 7, 2024
Integration between the Jetson AGX Orin and ConneectX-6 DX 100GbE card Jetson AGX Orin pcie	62	1814	October 16, 2023
PCIE(C0) - is not working for Jetson AGX orin Jetson AGX Orin pcie	15	1022	October 10, 2023
Jetson AGX Xavier Pcie(Root) detection of Device needs jetson reboot Jetson AGX Xavier pcie , fpga	25	1989	July 12, 2023
PCIe Device - No Link Jetson AGX Xavier	29	4340	September 15, 2021
USB 3.0 PCIe card not working Jetson AGX Orin pcie	21	662	May 7, 2024
Jetson AGX Orin board not able to enumerate FPGA PCIe card (Latest kernel 35.1.0) Jetson AGX Xavier pcie	7	2659	September 15, 2022
PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID) Jetson Orin NX pcie	17	1864	May 16, 2024
Boot stuck while enumerating NVMe via PFX switch, seems to be PCIe driver issue Jetson Xavier NX pcie , nvbugs	70	2264	April 13, 2024
WIFI-RTL8822CE,lspci,it was not detected Jetson Orin Nano wifi	58	1474	March 27, 2024

Occasionally `14160000.pcie: Phy link never came up`

Related Topics