Issue in PCIe communication between two Xavier(endpoint & rootport system)

Hi,
I’m testing the PCIe communication between two Xavier(Endpoint & Rootport system), in that I’m facing an issue.

Issue: Unable to access the Rootport system and getting the error message continuously in Rootport systems

[ 83.280350] pcieport 0005:00:00.0: AER: Corrected error received: id=0000
[ 83.280375] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[ 83.280595] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00000001/0000e000
[ 83.280773] pcieport 0005:00:00.0: [ 0] Receiver Error (First)

I followed the procedure from the following link to check the PCIe endpoint in Xavier board
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fxavier_PCIe_endpoint_mode.html%23wwpID0E0WD0HA

Step1: I flashed R32.4.3 with ODMDATA=0x09191000 in one Xvaier board for the PCIe endpoint system
Step2: In another Xavier board, I flashed the same R32.4.3 with ODMDATA=0x09190000 for the PCIe root port system

Step3: Connected PCIe cable between two Xavier board. In this cable TX & RX are swapped and we removed 12V and 3.3V in one end.

Step4: Booted the endpoint Jetson system and checked the clock mux selects NVHS_SLVS_REFCLK_P/N for endpoint system

root@nvep-desktop:/home/nvep# grep 253 /sys/kernel/debug/gpio
gpio-253 ( |pex-refclk-sel-high ) out hi

Step5: Followed below commands to enable the PCIe endpoint mode

root@nvep-desktop:/home/nvep# cd /sys/kernel/config/pci_ep/
root@nvep-desktop:/sys/kernel/config/pci_ep# mkdir functions/pci_epf_nv_test/func1
root@nvep-desktop:/sys/kernel/config/pci_ep# echo 0x10de > functions/pci_epf_nv_test/func1/vendorid
root@nvep-desktop:/sys/kernel/config/pci_ep# echo 0x0001 > functions/pci_epf_nv_test/func1/deviceid
root@nvep-desktop:/sys/kernel/config/pci_ep# ln -s functions/pci_epf_nv_test/func1 controllers/141a0000.pcie_ep/
root@nvep-desktop:/sys/kernel/config/pci_ep# echo 1 > controllers/141a0000.pcie_ep/start

Step5: Booted the Rootport Jetson system and getting the following error continuously and only sometimes we are able to access the device through ssh

[ 83.280350] pcieport 0005:00:00.0: AER: Corrected error received: id=0000
[ 83.280375] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[ 83.280595] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00000001/0000e000
[ 83.280773] pcieport 0005:00:00.0: [ 0] Receiver Error (First)
[ 83.371942] pcieport 0005:00:00.0: AER: Corrected error received: id=0000
[ 83.371963] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[ 83.372183] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00000001/0000e000
[ 83.372362] pcieport 0005:00:00.0: [ 0] Receiver Error (First)
[ 83.382454] pcieport 0005:00:00.0: AER: Corrected error received: id=0000
[ 83.382473] pcieport 0005:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0000(Receiver ID)
[ 83.382685] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00000001/0000e000
[ 83.382848] pcieport 0005:00:00.0: [ 0] Receiver Error (First)

root@nvrp-desktop:/home/nvrp# setpci -s 0005:01:00.0 COMMAND=0x02
root@nvrp-desktop:/home/nvrp# lspci -v

0001:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad2 (rev a1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 35
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 00000000-00000fff
Memory behind bridge: 40000000-400fffff
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Root Port (Slot-), MSI 00
Capabilities: [b0] MSI-X: Enable- Count=1 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] #19
Capabilities: [158] #26
Capabilities: [17c] #27
Capabilities: [190] L1 PM Substates
Capabilities: [1a0] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2a0] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [2d8] #25
Capabilities: [2e4] Precision Time Measurement
Capabilities: [2f0] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport

0001:01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9171 (rev 13) (prog-if 01 [AHCI 1.0])
Subsystem: Marvell Technology Group Ltd. Device 9171
Flags: bus master, fast devsel, latency 0, IRQ 564
I/O ports at 100010 [size=8]
I/O ports at 100020 [size=4]
I/O ports at 100018 [size=8]
I/O ports at 100024 [size=4]
I/O ports at 100000 [size=16]
Memory at 1230010000 (32-bit, non-prefetchable) [size=512]
Expansion ROM at 1230000000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [70] Express Legacy Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Kernel driver in use: ahci

0005:00:00.0 PCI bridge: NVIDIA Corporation Device 1ad0 (rev a1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 39
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
Memory behind bridge: 40000000-401fffff
Prefetchable memory behind bridge: 0000001c00000000-0000001c000fffff
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] Express Root Port (Slot-), MSI 00
Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] #19
Capabilities: [168] #26
Capabilities: [190] #27
Capabilities: [1c0] L1 PM Substates
Capabilities: [1d0] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2d0] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [308] #25
Capabilities: [314] Precision Time Measurement
Capabilities: [320] Vendor Specific Information: ID=0004 Rev=1 Len=054 <?>
Kernel driver in use: pcieport

0005:01:00.0 RAM memory: NVIDIA Corporation Device 0001
Flags: fast devsel, IRQ 255
Memory at 1f40100000 (32-bit, non-prefetchable) [disabled] [size=64K]
Memory at 1c00000000 (64-bit, prefetchable) [disabled] [size=128K]
Memory at 1f40000000 (64-bit, non-prefetchable) [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit-
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable- Count=8 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [148] #19
Capabilities: [168] #26
Capabilities: [190] #27
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [1c0] L1 PM Substates
Capabilities: [1d0] Vendor Specific Information: ID=0002 Rev=4 Len=100 <?>
Capabilities: [2d0] Vendor Specific Information: ID=0001 Rev=1 Len=038 <?>
Capabilities: [308] #25
Capabilities: [314] Precision Time Measurement
Capabilities: [320] Vendor Specific Information: ID=0003 Rev=1 Len=054 <?>

Step6: Writing shared memory in endpoint & reading it in the root port system and viceversa

root@nvep-desktop:~# dmesg|grep pci_epf_nv_test

[ 46.499526] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x436614000
[ 46.499535] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000
[ 46.499559] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0xffffff8008007000

[Writting in endpoint]
root@nvep-desktop:~# busybox devmem 0x436614000 32 0x12345678

[Reading in rootport]
root@nvrp-desktop:~# busybox devmem 0x1f40100000
0x12345678

[Writting in rootport]
root@nvrp-desktop:~# busybox devmem 0x1f40100004 32 0x09876543

[Reading in endpoint]
root@nvep-desktop:~# busybox devmem 0x436614004
0x09876543

Note:

  • Enabled CONFIG_PCIEASPM_PERFORMANCE=y in both Jetson Endpoint & Rootport system, still the issue is coming.

  • Sometimes only we can able to access the root port. Remaining times we are unable to access the root port, the error message only coming continuously

  • Tested two different PCIe cable and getting the same results:
    1.removed only 12V Power
    2.removed both 3.3V and 12V Power

Hope for the better guidance

Regards,
Bala

I think it is because of the bad electricals. We can’t do much at this point other than using a better cable. Just to understand if bad electricals are causing this issue only at higher speeds, you can reduce the speed by one at a time i.e. try Gen-3 and then Gen-2 and then Gen-1

Hi,
Sorry for the late reply.

I can able to communicate between two Xavier only If I disable the PCIe ASPM in both EP and RP system. I don’t know the reason, can you help me?

And I can able to read/write the data upto 4KB memory in the rootport side. If I write the data after 4KB memory address in the rootport side, I got the following error log from the endpoint side

[ 239.550580] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xffff2000, fsynr=0x80003, cb=0, sid=91(0x5b - PCIE5), pgd=43c40e003, pud=43c40e003, pmd=436f97003, pte=0

[ 304.996251] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu1, iova=0xffff1000, fsynr=0x80003, cb=0, sid=91(0x5b - PCIE5), pgd=43c40e003, pud=43c40e003, pmd=436f97003, pte=0

[ 316.147582] t19x-arm-smmu 12000000.iommu: Unhandled context fault: smmu0, iova=0xffff1000, fsynr=0x10013, cb=0, sid=91(0x5b - PCIE5), pgd=43c40e003, pud=43c40e003, pmd=436f97003, pte=0

Regards,
Bala

Ok. I didn’t expect that ASPM would be up. Any bad effects of bad electricals are multiplied with ASPM enabled. You can keep ASPM disabled.
I didn’t understand the below part

I think you must be allocating only 4K sized buffers on the root port for the endpoint to do DMA. Since SMMU is enabled on the root port side, any access beyond what was allocated would result in SMMU errors. Please increase the size of the allocations and that should solve the issue