Linux dies with Microchip PCIe switch

We have a design where we can connect NVME boards through a Microchip PCIe switch to either Jetson AGX Xavier SOM, or x86 ComExpress boards. With ComExpress boards we don’t have problems.

With Jetson Xavier kernel dies in the middle of booting. After some printk() debugging I found that the kernel dies in the tegra_pcie_dw_link_up() routine when it’s about to read PCI_EXP_LNKSTA DBI register. Before that the kernel reads a lot other DBI registers but only this attempt fails. It seems that the kernel completely hangs there and watchdog will restart the board some time later.
In another design we connect NVME SSDs directly and everything is working. The SSDs are connected to pcie@141a0000, Gen4, x8.
On the SOM interface board we also have a Broadcom NIC connected to PCIe x4 and it works perfectly. If we don’t power up the Microchip PCIe switch Linux boots up properly on the Jetson SOM. So it seems it’s only the PCIe switch which makes the kernel hang during boot.

I tested both the latest L4T from JetPack 5.0.2 and the Yocto-based build from OE4T/meta-tegra.
Thanks for any advice.

Any log to share?

Basically we fixed the original problem. It was the spread-spectrum clocking. For disabling it I did roughly the same as described here: Tegra NX PCIe Reference Clock Design - #11 by krisrst
The only difference is that I’m using the latest meta-tegra Yocto build, so I modified L4T-tegra-35.1.0-r0/Linux_for_Tegra/boo
tloader/t186ref/tegra194-a02-bpmp-p2888-a04.dtb , rebuilt “tegra-bootfiles” package and then “core-image-minimal”. After the usual “sudo ./doflash.sh” step both UEFI and Linux started properly, the PCIe switch and some NVME SSDs were recognized properly.

Anyway I still think there is a HW issue in the PCIe peripheral. As I wrote earlier Linux dies when SSC is enabled and tegra_pcie_dw_link_up() gets called. The exact instruction which makes the CPU hang is a plain memory read of the PCI_EXP_LNKSTA register. It’s normal that the PCIe link won’t be established if clock settings at the PCIe switch and at the Xavier are not compatible. But the memory read instruction should not make the CPU hang.
But I just don’t care anymore as disabling SSC on PLLNVHS solved our original problem.
Kudos to krisrst for summarizing required steps of modifying the “bpmp” DTB file.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.