Root-endpoint PCIe communication between Xaviers with a bridge in the way

Hi,
I’ve set up one Xavier as a root, and a separate Xavier as an endpoint. I’m connecting the of them via iPass PCIe cards.
This is the endpoint card: https://www.onestopsystems.com/product/pcie-x8-gen3-dual-port-cable-adapter?no_cache=1610007541
This is the root card: EPCIE4XRDCA01 | PCIe Expansion | External PCI Express x4 (with Gen 1/2/3 Redriver) Cable Adapter Card | PCIe Cable Adapter | EPCIE4XRDCA01
My L4T version is 32.2.3.

The endpoint Xavier was set up the following way:

  1. Enabled bit 12 of ODMDATA in p2972-0000.conf.common (0x09190000 to 0x09191000)
  2. ./flash.sh jetson-xavier mmcblk0p1

cd /sys/kernel/config/pci_ep/
mkdir functions/pci_epf_nv_test/func1
echo 0x10de > functions/pci_epf_nv_test/func1/vendorid
echo 0x0001 > functions/pci_epf_nv_test/func1/deviceid
ln -s functions/pci_epf_nv_test/func1 controllers/141a0000.pcie_ep/
echo 1 > controllers/141a0000.pcie_ep/start

On the endpoint Xavier, I can use dmesg | grep pci_epf_nv_test to get the RAM address, and read/write to it from the endpoint using busybox devmem.

However, the root Xavier doesn’t appear to have access to the endpoint Xavier. Running lspci -v shows several PCIe bridges, rather than RAM memory: NVIDIA Corporation Device 0001. Reading the memory at 0x1f40400000 always returns 0x874910B5, rather than memory set on the endpoint. Is there a way to gain visibility past these bridges?

Below is the host’s lspci -v output:

0001:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad2 (rev a1) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 34
        Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
        I/O behind bridge: 00000000-00000fff
        Memory behind bridge: 40000000-400fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Root Port (Slot-), MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] #19
        Capabilities: [158] #26
        Capabilities: [17c] #27
        Capabilities: [190] L1 PM Substates
        Capabilities: [1a0] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
        Capabilities: [2a0] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [2d8] #25
        Capabilities: [2e4] Precision Time Measurement
        Capabilities: [2f0] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
        Kernel driver in use: pcieport

0001:01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9171 (rev 13) (prog-if 01 [AHCI 1.0])
        Subsystem: Marvell Technology Group Ltd. Device 9171
        Flags: bus master, fast devsel, latency 0, IRQ 563
        I/O ports at 100010 [size=8]
        I/O ports at 100020 [size=4]
        I/O ports at 100018 [size=8]
        I/O ports at 100024 [size=4]
        I/O ports at 100000 [size=16]
        Memory at 1230010000 (32-bit, non-prefetchable) [size=512]
        Expansion ROM at 1230000000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [70] Express Legacy Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: ahci

0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 38
        Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
        I/O behind bridge: 00000000-00001fff
        Memory behind bridge: 40000000-404fffff
        Prefetchable memory behind bridge: 0000001c00000000-0000001c003fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [70] Express Root Port (Slot-), MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] #19
        Capabilities: [168] #26
        Capabilities: [190] #27
        Capabilities: [1c0] L1 PM Substates
        Capabilities: [1d0] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
        Capabilities: [2d0] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
        Capabilities: [308] #25
        Capabilities: [314] Precision Time Measurement
        Capabilities: [320] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
        Kernel driver in use: pcieport

0005:01:00.0 PCI bridge: PLX Technology, Inc. Device 8749 (rev ca) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 819
        Memory at 1f40400000 (32-bit, non-prefetchable) [size=256K]
        Bus: primary=01, secondary=02, subordinate=04, sec-latency=0
        I/O behind bridge: 3a100000-3a101fff
        Memory behind bridge: 40000000-403fffff
        Prefetchable memory behind bridge: 0000001c00000000-0000001c003fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
        Capabilities: [68] Express Upstream Port, MSI 00
        Capabilities: [a4] Subsystem: One Stop Systems, Inc. Device 2302
        Capabilities: [100] Device Serial Number 00-a0-d6-ff-ff-04-84-66
        Capabilities: [fb4] Advanced Error Reporting
        Capabilities: [138] Power Budgeting <?>
        Capabilities: [10c] #19
        Capabilities: [148] Virtual Channel
        Capabilities: [e00] #12
        Capabilities: [b00] Latency Tolerance Reporting
        Capabilities: [b70] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?>
        Kernel driver in use: pcieport

0005:02:00.0 PCI bridge: PLX Technology, Inc. Device 8749 (rev ca) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 820
        Bus: primary=02, secondary=03, subordinate=03, sec-latency=0
        I/O behind bridge: 3a100000-3a100fff
        Memory behind bridge: 40000000-401fffff
        Prefetchable memory behind bridge: 0000001c00000000-0000001c001fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
        Capabilities: [68] Express Downstream Port (Slot+), MSI 00
        Capabilities: [a4] Subsystem: One Stop Systems, Inc. Device 2302
        Capabilities: [100] Device Serial Number 00-a0-d6-ff-ff-04-84-66
        Capabilities: [fb4] Advanced Error Reporting
        Capabilities: [138] Power Budgeting <?>
        Capabilities: [10c] #19
        Capabilities: [148] Virtual Channel
        Capabilities: [e00] #12
        Capabilities: [f24] Access Control Services
        Capabilities: [b70] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?>
        Kernel driver in use: pcieport

0005:02:08.0 PCI bridge: PLX Technology, Inc. Device 8749 (rev ca) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 821
        Bus: primary=02, secondary=04, subordinate=04, sec-latency=0
        I/O behind bridge: 3a101000-3a101fff
        Memory behind bridge: 40200000-403fffff
        Prefetchable memory behind bridge: 0000001c00200000-0000001c003fffff
        Capabilities: [40] Power Management version 3
        Capabilities: [48] MSI: Enable+ Count=1/8 Maskable+ 64bit+
        Capabilities: [68] Express Downstream Port (Slot+), MSI 00
        Capabilities: [a4] Subsystem: One Stop Systems, Inc. Device 2302
        Capabilities: [100] Device Serial Number 00-a0-d6-ff-ff-04-84-66
        Capabilities: [fb4] Advanced Error Reporting
        Capabilities: [138] Power Budgeting <?>
        Capabilities: [10c] #19
        Capabilities: [148] Virtual Channel
        Capabilities: [e00] #12
        Capabilities: [f24] Access Control Services
        Capabilities: [b70] Vendor Specific Information: ID=0001 Rev=0 Len=010 <?>
        Kernel driver in use: pcieport```

I think there is a fundamental issue of your Xavier endpoint not being detected here (confirmed from the lspci output that is attached). Memory accesses come later.
Couple of queries…

  • Have you tried connecting the Xavier-EP directly to the Xavier-RP and confirmed that things are fine?
  • Could you please confirm that these ipass cables also carry REFCLK, PERST# signals along with Tx/Rx signals?
  • Since there is a PCIe switch also introduced in between, has the switch been configured to propagate REFCLK and PERST# signals that it has received from the host?
  • Could you please check if there is any increment in the interrupt count for ‘pex_rst’ under /proc/interrupts ? This interrupt represents PERST# signal’s transition from assert-to-deassert.
  1. We don’t have the required cable to directly connect the Xaviers. Do you think the R88SS on this page would be sufficient? R88 PCIe x8 Jumpers Cable It seems that another user on this forum had some success: xavier pcie endpoint mode - #9 by arunas.salkauskas
  2. I could not find a primary source stating that iPass carries REFCLK and PERST#, however this (admittedly different) cable carries PERST# and more (http://cdn.teledynelecroy.com/files/manuals/ipass-connector-to-summit-t4-cable-quick-start-guide.pdf), and this product page lists our type of cable’s pins, including CPERST#, CREFCLK+ and CREFCLK- (External PCIe x4 Cable (iPass 38pin compatible) Cable (CB-00485 / CB-00518) ,745460400, 745460401, 745460402, 745460403, 745460404, 745460405 | CB-00485-A)
  3. The PCIe switch is part of endpoint’s PCIe card, and I have configured it into its “target mode”. I assume this means that it should propagate REFCLK and PERST#, but don’t know for sure.
  4. pex_rst was initially 0, however upon booting up the host Xavier it incremented to 1. Does this imply that the PERST# signal is correctly propagating from the Xavier-RP to Xavier-EP?

Are there any other actions I can take to attempt to debug this issue? Perhaps some way to verify that REFCLK is being propagated as well, in case the presence of PERST# doesn’t necessarily imply that REFCLK is already present?

I think it should work. I’m banking on the line " * Application: Signal swap" to mean that Tx and Rx are swapped which is what is required.

I think we need to get confirmation on this part for sure.

Yes. That rules out issues with PERST#. Now, we need to check if REFCLK too is reaching Xavier-EP or not. BTW, do you see any error prints in Xavier-EP? I expect some errors because, upon observing PERST# interrupt, Xavier-EP goes ahead and tries to configure it for EP functionality but since REFCLK is not available (assuming that is the issue here), it should throw some error. So, would be good to check the log of Xavier-EP if it provides any clues.

I’ve attached dmesg output for both the root and endpoint. The only error I see on the endpoint is the following, although I’m not sure if it’s of any significance:
[ 7.839838] tegra-pcie-dw 141a0000.pcie_ep: invalid max-speed (err=-22), set to Gen-1
Does an invalid max-speed indicate an issue with REFCLK?
dmesg_ep.txt (65.5 KB) dmesg_rp.txt (69.9 KB)

I do see the following prints from EP system which means that EP system must be receiving REFCLK as well from the RP system.

[  178.914841] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x4108fe000
[  178.914870] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000
[  178.914933] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0xffffff800a435000
[  244.643270] tegra-pcie-dw 141a0000.pcie_ep: EP init done

Well, now I have to doubt the Tx/Rx which I haven’t been from the beginning.
So, how is the EP system connected behind the switch exactly?
Could you please give me the output of ‘sudo lspci -vv’? I’m just wondering if the PCIe link between Xavier-EP and one of switch’s downstream ports came up at a later point.

This is how the system is connected up:

This is what sudo lspci -vv returns on both Xavier-RP and Xavier-EP. There’s plenty to see on Xavier-RP, but apparently not much besides the SATA controller on Xavier-EP (Expected or unexpected?)
lspci_rp.txt (22.2 KB) lspci_ep.txt (6.5 KB)

For the Xavier-RP, there are three PCI bridge: PLX Technology entries, which I’m assuming is Xavier-RP’s visibility of the OSS-PCIE-HIB38-X8-DUAL.

Expected. Because lspci only lists the controllers in the system that are operating in the root port mode and the hierarchy that gets enumerated under those root ports. Controller(s) operating in the endpoint mode in the same system (which are not under one of its own root ports hierarchically) are not listed in lspci. Hence it is expected.

As I see from root port’s lspci output, the PCIe switch has two downstream ports (0005:02:00.0 and 0005:02:08.0) and both of them have DLActive- which means that the link is not up even at a later point in time also.

BTW, How are OSS-PCIE-HIB38-X8-DUAL and Xavier-EP connected? Since OSS-PCIE-HIB38-X8-DUAL is a PCIe switch with downstream ports, I expect them to have a female ports (just like Xavier AGX systems have) and how is that female port and the female port of Xavier-EP connected with each other? The reason why I want to know about it is if the cable used there doesn’t swapTx and Rx (i.e. switch female port’s Tx should get connected to Xavier-EP’s Rx and vice-a-versa) then, the link doesn’t come up.

The OSS-PCIE-HIB38-X8-DUAL is installed in the PCIe slot of the Xavier-EP (as it is a dev kit). Here’s a product photo, as well as an annotated diagram:
image

Would you suggest swapping the TX and RX pins between the OSS-PCIE-HIB38-X8-DUAL and Xavier-EP, perhaps by using an R88NF?

Worth a try. I can’t think of any other reason causing the PCIe link up issue at this point.