We have connected a LAN7431/LAN8870 controller board to the pcie system of a custom AGX Orin platform running kernel 6.6.29 with JetPack 6.0.
The 4 LAN7431/LAN8870 controllers are successfully detected on the pcie bus but a loopback test fails unless we set the kernel option “pci=nomsi”. The same LAN7431/LAN8870 controller board runs fine with msi enabled on a x86_64 (Intel Elkhart Lake) platform.
When we use kernel option pci=nomsi, communication (iperf3) starts but suddenly stops after some minutes without any error message.
Any idea why msi must be disabled on the AGX Orin?
Connecting one interface to an external network works also without the “pci=nomsi” setting and does not give any error message.
Thank you for your help.
[ 0.000000] Linux version 6.6.29-tegra (root@syslogic-desktop) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08) 11.3.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Mon May 6 13:37:47 CEST 2024
Could you fallback to K5.15 in JP6.0GA?
We don’t debug the issue in K6.6 directly.
Yes we are in contact with the vendor.
However, we also have x86 platforms where we use the same LAN743x controllers, and they are working just fine also with MSI-X interrupts activated.
We came across the following topic:
Is it possible that we face a similar issue as mentioned there?
Thank you.
K5.15 is not an option at the moment as we do not have a working driver for that kernel version.
We did further tests and now the test of the LAN743x controllers works for some time without the pci=nomsi argument. But still the communication stops after some time with the message:
When we add pci=noaer to the kernel arguments, we are able to recover the communication by doing:
ifconfig poe1 down
ifconfig poe1 up
We test the communication with iperf3. When we just ping the interfaces, we do not see any connection loses, therefore it seems to have to do with the bandwidth/interrupt amount.
Any help is really appreciated.
Thank you.
As the error is a “CmpltTO”, is there any possibility to increase this timeout? is this part of the pcie driver or specific for each pcie device?
Thank you.
Guessing you’re using an out of kernel driver for your board? There does appear to be a PHY and controller driver for your hardware in the Nvidia kernel 5.15 in R36.3, have you tried that? You’ll need to enable the module as it isn’t configured by default.
I would also check whether your driver is enabling RTD3 runtime power management, because this doesn’t work with the Nvidia PCIe root port driver as it stands. Look for calls to pm_runtime_allow and remove them to test without.
I see your controller board has multiple 1GbE interfaces. If you do get it working you might find you’re not seeing very good performance, in which case the patch I wrote mentioned by KevinFFFF could be useful for you, but it doesn’t explain your current issues.
It seems like finally we found the solution.
The error we have seen came from an errata of the pcie switch used in combination with the LAN743x controllers. Changing the settings of the switch solves the issue.
Thank you for your help in this case.