PCIe doesn't work on my custom board with a TX2 4GB

Hi,
We designed our own carrier board based on TX2 4G, one USB 3.0 and a M.2 (M key) interface were implemented. The M.2 interface used the complete 4 lanes of the pcie 0. The design followed the default PCIE&USB lane configuration (config 2) of your EV kit P2771. We connected a M.2 (M key) device then checked it but we couldn’t even find it.
Some dmesg information was enclosed for your reference.
Please give me some advice.
Thanks.

nvidia@nvidia-desktop:~$ dmesg |grep pci
[ 0.462619] iommu: Adding device 10003000.pcie-controller to group 53
[ 0.462635] arm-smmu: forcing sodev map for 10003000.pcie-controller
[ 1.037274] tegra-pcie 10003000.pcie-controller: 4x1, 1x1 configuration
[ 1.038372] tegra-pcie 10003000.pcie-controller: PCIE: Enable power rails
[ 1.038793] tegra-pcie 10003000.pcie-controller: probing port 0, using 4 lanes
[ 1.042087] tegra-pcie 10003000.pcie-controller: probing port 2, using 1 lanes
[ 1.192160] ehci-pci: EHCI PCI platform driver
[ 1.192207] ohci-pci: OHCI PCI platform driver
[ 1.479879] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 1.888991] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 2.296245] tegra-pcie 10003000.pcie-controller: link 0 down, retrying
[ 2.298288] tegra-pcie 10003000.pcie-controller: link 0 down, ignoring
[ 2.703446] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 3.110496] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 3.523907] tegra-pcie 10003000.pcie-controller: link 2 down, retrying
[ 3.525946] tegra-pcie 10003000.pcie-controller: link 2 down, ignoring
[ 3.730490] tegra-pcie 10003000.pcie-controller: PCIE: no end points detected
[ 3.730820] tegra-pcie 10003000.pcie-controller: PCIE: Disable power rails

There could be multiple reasons for this. Please check the following

  • Check Tx/Rx lane routing is fine. This can be confirmed by using a PCIe loopback card and checking if the link comes up or not.
  • Once the link is up in the above experiment, check, if you are able to observe a 100 MHz signal on the REFCLK lanes with the help of a scope.
  • Check if you are able to observe a transition of PERST# lane during the link-up procedure.

Hi Vidyas,
Thanks for the quick reply.
I checked the circuits these days, it looks the same routing with the EV kit board P2597. Anyway, i will try to get a PCIe loopback card and double check it. Do you have a recommendation for the PCIe loopback card? I tried to search it on online but failed to find a M.2(M-key) PCIe loopback card.
Meanwhile, is it possible that the problem came from the software setting? I found your solution to a similar problem in the post “PCIe-USB bridge can't be detected by TX2”, it said “delay applying reset to end point by 100ms after refclk started”, where (in which file) to make this change?

By the way, i measured the PCIe REFCLK on our board and the 100MHz signal can not be probed, then we measured the same signal on a 3rd part EV kit board(with same M.2 interface), we did see the 100MHz REFCLK signal, the amplitude is about 250mv to 300mv. Both tests were done with no PCIe device connected.
The REFCLK signal was not generated at all on our board.

Hi vidyas,
I check the REFCLK and PERST$ lane, neither of them have signal during the link-up procedure. Since the project delayed quite a long time due to this issue and the USB issue(I issue in another post), please help us.

I got some error messages during the boot process, can these errors lead to the PCIe failure?


It seems the screen shots can’t be uploaded, I am typing the major error messages below:
[ 0.548652] pca953x 0-0074: failed reading register
[ 1.309718] ina3221x 0-0042: ina3221 reset failure status:0xffffff87
[ 2.449963] tegra-xusb 3530000.xhci: ERROR: unexpected setup address command completion code 0x11.
[ 2.865670] usb 1-2: device not accepting address 2, error -22
cp: not writing through dangling symlink ‘etc/resolv.conf’
[ 3.909821] cgroup: cgroup2: unknown option “nsdelegate”

No, this error is not related to PCIe. I guess you shall still such issue when no pcie device is connected.
Please reboot few times to see if you could bypass it.

Also, are you able to dump log from serial console instead of using screen?

Hi WayneWWW,
Could you tell me how to use the serial console to dump the log? Thanks.

https://elinux.org/Jetson/General_debug

Thanks WayneWWW.
Please find the uploaded log file for the serial console.
One more issue is the USBs don’t work anymore after I updated the L4T from R32.2.1 to R32.3.1 on my custom board with TX2 4GB. It worked on R32.2.1
Boot Console.log (21.3 KB)

Hi,

Please read the elinux page I just posted and remove the quiet keyword. Your log is not the full boot up log.

If you don’t want to do it, you could also login and type the command “dmesg” and share the result with me too.

For your usb issue, I guess the device tree has some problem. But cannot tell which part is wrong by just read those log. You better reading the TX2 adaptation guide to modify the device tree.

Also, if you still have issue with usb, please file a separate topic because this topic is for PCIe.

Hi,
OK, i will read the elinux page again.
The point is my USB doesn’t work now, so i can’t type the dmesg command, and i can’t even login. For the PCIe issue, i did read the dmesg before which was based on R31.2.1. Please check if it’s helpful.
Custom Board_TX2 4GB_dmesg.log (60.5 KB)

If there is no device detected, then, SW would power down the controller. So, we observe the REFCLK only for a short time (time during which the root port tries to enumerate the endpoint device).
To keep the root port alive even after no endpoint device is detected, please remove the following entry from the respective PCIe controller node.
nvidia,enable-power-down;
This should enable you to see the REFCLK anytime after booting is done (assuming there are no routing issues with the REFCLK signal)

Hi vidyas,
I didn’t find “nvidia,enable-power-down” in the PCIe controller nodes. To be exactly, i didn’t find this sentence in those dts files. Please clarify which file do you mean?

Thanks.

My apologies. I thought this thread is for JetsonAGXXavier.
Since this is for TX2, you can apply the below change to keep the PCIe controller enabled even with no PCIe link up.

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 63c0e343d388..52c46b68981e 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -2194,7 +2194,7 @@ static bool tegra_pcie_port_check_link(struct tegra_pcie_port *port)
                tegra_pcie_port_reset(port);
        } while (--retries);
  
-       return false;
+       return true;
 }
  
 static void tegra_pcie_apply_sw_war(struct tegra_pcie_port *port,

Hi,
After several days’ debug, finally we found it was a hardware issue. The RX and TX signals were inverted on our custom carrier board. We fixed it, and the PCIe device (video capture card) can be detected now.
However, when we did the same verification test with another PCIe device (another brand video capture card), it can’t be detected. But this failed card did work fine on the 3rd part Eval carrier board.
Then we installed the failed card to NVIDIA carrier board through a PCIe to M.2 converter, it still didn’t work. We did a double check, the previous workable card works fine on the NVIDIA carrier board with this converter.
Please give me some hints on the problem?
Thanks.

Hi,
After I modified the pci-tegra.c to to “return true” to keep the PCIe controller enabled in the tegra_pcie_port_check_link function, the lspci and dmesg commands got the result as shown in the picture.

.

So, what to do next?
Thanks

Along with the change mentioned above, can you please play around with the time in the below change and see if that helps? (i.e. increasing the 100ms to 1000ms etc…)
Also, in the event of the link not being up, please share the output of ‘sudo lspci -vvvv’.
I would like to take a look at “DLActive”. Even without link up, if it shows up as “DLActive+” instead of “DLActive-”, then, the link is up with the device at a later point of time, which means, the delay has to be increased even further.

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 63c0e343d388..a15964e78ffe 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -2194,7 +2194,7 @@ static bool tegra_pcie_port_check_link(struct tegra_pcie_port *port)
                tegra_pcie_port_reset(port);
        } while (--retries);
 
-       return false;
+       return true;
 }
 
 static void tegra_pcie_apply_sw_war(struct tegra_pcie_port *port,
@@ -2515,7 +2515,7 @@ static void tegra_pcie_check_ports(struct tegra_pcie *pcie)
        }
 
        /* Wait for clock to latch (min of 100us) */
-       udelay(100);
+       msleep(100);
        reset_control_deassert(pcie->pciex_rst);
        /* at this point in time, there is no end point which would
         * take more than 20 msec for root port to detect receiver and