TX2 PCIe does not detect endpoint

Hello,

We’ve a carrier with two PCIe slots. In one slot, an Ethernet controller card doesn’t work but it works in the other slot. But, another type of PCIe card does work in that slot. We’re using TX2. 28.1. We’re getting customer requests to get this solved. We saw this issue in TX1 and thought TX2 would solve this, but the issue remains. I’m not sure how to proceed.

Following is the output of PCI from dmesg for the working slot

[ 0.232584] GPIO line 459 (pcie-lane2-mux) hogged as output/low
[ 0.235857] iommu: Adding device 10003000.pcie-controller to group 50
[ 0.368574] PCI: CLS 0 bytes, default 128
[ 13.263382] tegra-pcie 10003000.pcie-controller: wrong configuration updated in DT, switching to default 2x1, 1x1, 1x1 configuration
[ 13.264261] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[ 13.264736] tegra-pcie 10003000.pcie-controller: probing port 0, using 2 lanes
[ 13.266923] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[ 13.716557] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 14.120562] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 14.523523] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 14.525545] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[ 14.525827] tegra-pcie 10003000.pcie-controller: PCI host bridge to bus 0000:00
[ 14.525830] pci_bus 0000:00: root bus resource [mem 0x50100000-0x57ffffff]
[ 14.525833] pci_bus 0000:00: root bus resource [mem 0x58000000-0x7fffffff pref]
[ 14.525835] pci_bus 0000:00: root bus resource [bus 00-ff]
[ 14.525837] pci_bus 0000:00: root bus resource [io 0x1000-0xffff]
[ 14.525859] pci 0000:00:01.0: [10de:10e5] type 01 class 0x060400
[ 14.525947] pci 0000:00:01.0: PME# supported from D0 D1 D2 D3hot D3cold
[ 14.526163] pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 14.526302] pci 0000:01:00.0: [8086:10d3] type 00 class 0x020000
[ 14.526367] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0001ffff]
[ 14.526394] pci 0000:01:00.0: reg 0x18: [io 0x0000-0x001f]
[ 14.526408] pci 0000:01:00.0: reg 0x1c: [mem 0x00000000-0x00003fff]
[ 14.526536] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[ 14.532584] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[ 14.532652] pci 0000:00:01.0: BAR 8: assigned [mem 0x50100000-0x501fffff]
[ 14.532655] pci 0000:00:01.0: BAR 7: assigned [io 0x1000-0x1fff]
[ 14.532658] pci 0000:01:00.0: BAR 0: assigned [mem 0x50100000-0x5011ffff]
[ 14.532667] pci 0000:01:00.0: BAR 3: assigned [mem 0x50120000-0x50123fff]
[ 14.532676] pci 0000:01:00.0: BAR 2: assigned [io 0x1000-0x101f]
[ 14.532684] pci 0000:00:01.0: PCI bridge to [bus 01]
[ 14.532687] pci 0000:00:01.0: bridge window [io 0x1000-0x1fff]
[ 14.532693] pci 0000:00:01.0: bridge window [mem 0x50100000-0x501fffff]
[ 14.532755] pcieport 0000:00:01.0: enabling device (0000 → 0003)
[ 14.532843] pcieport 0000:00:01.0: Signaling PME through PCIe PME interrupt
[ 14.532845] pci 0000:01:00.0: Signaling PME through PCIe PME interrupt
[ 14.532850] pcie_pme 0000:00:01.0:pcie01: service driver pcie_pme loaded
[ 14.532918] aer 0000:00:01.0:pcie02: service driver aer loaded
[ 14.646984] e1000e 0000:01:00.0 eth1: (PCI Express:2.5GT/s:Width x1) 00:0c:8b:53:03:ed

Following is the output of lspci -xx

00:01.0 PCI bridge: NVIDIA Corporation Device 10e5 (rev a1)
00: de 10 e5 10 07 00 10 00 a1 00 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 11 11 00 00
20: 10 50 10 50 f1 ff 01 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 84 01 00 00

01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
00: 86 80 d3 10 06 04 10 00 00 00 00 02 10 00 00 00
10: 00 00 10 50 00 00 00 00 01 10 00 00 00 00 12 50
20: 00 00 00 00 00 00 00 00 00 00 00 00 86 80 00 00
30: 00 00 00 00 c8 00 00 00 00 00 00 00 84 01 00 00


Following is the output of PCI from dmesg for the non-working slot

[ 0.233030] GPIO line 459 (pcie-lane2-mux) hogged as output/low
[ 0.236289] iommu: Adding device 10003000.pcie-controller to group 50
[ 0.367838] PCI: CLS 0 bytes, default 128
[ 13.323284] tegra-pcie 10003000.pcie-controller: wrong configuration updated in DT, switching to default 2x1, 1x1, 1x1 configuration
[ 13.324992] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[ 13.325416] tegra-pcie 10003000.pcie-controller: probing port 0, using 2 lanes
[ 13.332821] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[ 13.770097] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 14.172081] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 14.572143] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 14.574158] tegra-pcie 10003000.pcie-controller: link 0 down, ignoring
[ 14.980073] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 15.383430] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 15.787955] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 15.789975] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[ 15.789983] tegra-pcie 10003000.pcie-controller: PCIE: no end points detected

What could be the issue? How to move forward?

Thanks,
Hakim

You might want to post the “sudo lspci -vvv” output when in a working slot. You can use “-s” option to limit output to that slot, e.g., if regular lspci shows “00:00.0”, then “sudo lspci -s 00:00.0 -vvv”.

This will show card capabilities, and will also show if speed was throttled back. I see in there “PCI Express:2.5GT/s:Width x1”, but this only says what the speed is…not what it is capable of. Knowing if it throttled back (capability might be 5GT/s) might be a clue to signal quality. If the card does not show up at all in a slot, then that slot might be disappearing because signal quality cannot reach even 2.5GT/s.

this is the output for lspci -vvv in the working slot

01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: Intel Corporation 82574L Gigabit Network Connection
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 388
Region 0: Memory at 50100000 (32-bit, non-prefetchable)
Region 2: I/O ports at 1000 [disabled]
Region 3: Memory at 50120000 (32-bit, non-prefetchable)
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <128ns, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [a0] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-0c-8b-ff-ff-53-03-ed
Kernel driver in use: e1000e

What kind of carrier board is it and looking at the log, root port-0 is not able to enumerate the connected end point device.
How many lanes are routed to the slot which is controlled by root port-0?
In the working case, I see that Intel-NIC got enumerated on root port-0. Which card did you connect in the non-working case? Is that a x1 card?

It is an Elroy

1 is routed to root port 0, 0 to port 1 and 1 to port 2

It’s same card that is being tried in both slots. Yes, it is a x1 card.

Can you please elaborate more on “1 is routed to root port 0, 0 to port 1 and 1 to port 2” ?

We’re using Config #3.

Fowlloing is the DT

pcie-controller@10003000 {
status = “okay”;
pci@1,0 {
nvidia,num-lanes = <1>;
status = “okay”;

    };
    pci@2,0 {
        nvidia,num-lanes = <0>;
        status = "disabled";
    };
    pci@3,0 {
        nvidia,num-lanes = <1>;
        status = "okay";
    };
};