Why PCIe stalls?

Hi,
I am running TX1 24.2.1 as a PCIe root with Zynq endpoint. Zynq uses VDMA to stream video to TX1.
It works OK most of the time, but sometimes after many minutes (sometimes hours) PCIe stalls.
I compared TX1 register values and see only 2 differences between normal operation and stall:

AFI_CONFIGURATION_0 bit INITIATOR_WRITE_IDLE
sudo busybox devmem 0x010038AC w
0x00FE8E41 - before stall
0x00FC8E41 - after stall

AFI_REQ_PENDING_0 bits TMS0C02SM_NONISO_PENDING and TMS0C02SM_COH_REQUEST_PEND
sudo busybox devmem 0x010038F4 w
0x00000000 - before stall
0x00000009 - after stall

Could you, please, explain what this “request pending” means, what could have caused it and what can be my next steps to troubleshoot the problem.

Thank you

Can you please try the following patch in the driver and let us know if it solves the issue?

diff --git a/drivers/pci/host/pci-tegra.c b/drivers/pci/host/pci-tegra.c
index 055da45160e2..e7ec0820a37d 100644
--- a/drivers/pci/host/pci-tegra.c
+++ b/drivers/pci/host/pci-tegra.c
@@ -119,7 +119,8 @@
 #define AFI_MSI_EN_VEC7_0                                              0xa8

 #define AFI_CONFIGURATION                                              0xac
-#define AFI_CONFIGURATION_EN_FPCI                              (1 << 0)
+#define AFI_CONFIGURATION_EN_FPCI                              BIT(0)
+#define AFI_CONFIGURATION_CLKEN_OVERRIDE               BIT(31)

 #define AFI_FPCI_ERROR_MASKS                                           0xb0

@@ -1541,7 +1542,8 @@ static int tegra_pcie_enable_controller(struct tegra_pcie *pcie)

        /* Finally enable PCIe */
        val = afi_readl(pcie, AFI_CONFIGURATION);
-       val |=  AFI_CONFIGURATION_EN_FPCI;
+       val |=  (AFI_CONFIGURATION_EN_FPCI |
+                       AFI_CONFIGURATION_CLKEN_OVERRIDE);
        afi_writel(pcie, val, AFI_CONFIGURATION);

        val = (AFI_INTR_EN_INI_SLVERR | AFI_INTR_EN_INI_DECERR |

Thank you for your reply.
Your suggestion “+ AFI_CONFIGURATION_CLKEN_OVERRIDE”
is essentially the same as
sudo busybox devmem 0x010038ac w 0x80FE8E41
from
https://devtalk.nvidia.com/default/topic/996441/jetson-tx1/bad-mode-in-error-handler-detected-code-0xbf000002/
It sets bit 31: CLKEN_OVERRIDE: This can override the clock enable in case of malfunction.
"
Setting this bit to one would set Tegra to always output reference clock to the endpoint.
Clear this bit would mean reference clock output is controlled by CLKREQ# sideband signal that is driven by the endpoints.
It appears this signal could be deasserted when the link is running, resulting Tegra not output the reference clock to the endpoint, which would essentially hang the endpoint.
"

Yes, setting this bit does work for me - I do not see more stalls,
but I would like to understand better how this works.
I do have CLKREQ# driven low on my board.
I also see that TX1 also drives CLKREQ# low when PCIe is enabled.
I also see that PCIe Clock never disappears even when stall happens.
So, why CLKEN_OVERRIDE bit can make a difference in my case?

The name ‘CLKEN_OVERRIDE’ might be a bit confusing with PCIe’s CLKREQ sideband signal, but, both are different.
CLKEN_OVERRIDE here controls one internal clk which was opportunistically clock gating some portion of the IP which was leading to stalls in high-bandwidth use cases and setting this bit would disable that opportunistic clock gating. To re-iterate, this is nothing to do with CLKREQ sideband signal. Hope this explains.

To re-iterate, this is nothing to do with CLKREQ sideband signal.

Interesting. I was confused by this post by AastaLLL:
https://devtalk.nvidia.com/default/topic/996441/jetson-tx1/bad-mode-in-error-handler-detected-code-0xbf000002/
"Clear this bit would mean reference clock output is controlled by CLKREQ# sideband signal that is driven by the endpoints. It appears this signal could be deasserted when the link is running, resulting Tegra not output the reference clock to the endpoint, which would essentially hang the endpoint. "

But you are saying that even if my CLKREQ# always asserted, I still need to set CLKEN_OVERRIDE bit, right?

The last question: is it better to use busybox approach or rebuilding kernel approach to set this bit?
Adding “busybox devmem 0x010038ac w 0x80FE8E41” to boot script would be easier than rebuilding kernel,
but are there any advantages in rebuilding the kernel?

Thank you

But you are saying that even if my CLKREQ# always asserted, I still need to set CLKEN_OVERRIDE bit, right?
Yes. Thats right. I’ll ask AastaLLL to correct the post

The last question: is it better to use busybox approach or rebuilding kernel approach to set this bit?
It is better to rebuild the kernel, but for whatever reason, you are not able to do it, then, adding busybox command to boot script should also do