"pci=nomsi" necessary for loopback test of LAN743x controllers

Dear NVidia Team

We have connected a LAN7431/LAN8870 controller board to the pcie system of a custom AGX Orin platform running kernel 6.6.29 with JetPack 6.0.
The 4 LAN7431/LAN8870 controllers are successfully detected on the pcie bus but a loopback test fails unless we set the kernel option “pci=nomsi”. The same LAN7431/LAN8870 controller board runs fine with msi enabled on a x86_64 (Intel Elkhart Lake) platform.

When we use kernel option pci=nomsi, communication (iperf3) starts but suddenly stops after some minutes without any error message.

Any idea why msi must be disabled on the AGX Orin?

The error we get without pci=nomsi:

[  124.046071] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error message received from 0005:03:00.0
[  124.046548] lan743x 0005:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[  124.057993] lan743x 0005:03:00.0:   device [1055:7431] error status/mask=00004000/00400000
[  124.066525] lan743x 0005:03:00.0:    [14] CmpltTO                (First)
[  124.073475] lan743x 0005:03:00.0: AER: can't recover (no error_detected callback)
[  124.073523] pcieport 0005:02:01.0: AER: device recovery failed

Connecting one interface to an external network works also without the “pci=nomsi” setting and does not give any error message.
Thank you for your help.

Kind regards

Hi sevm89,

Please share the full dmesg for further check.

Hi KevinFFF

Here the dmesg output with the argument “pci=nomsi” and without:
dmesg_nomsi.txt (89.2 KB)
dmesg.txt (102.0 KB)

Thank you.

Hi KevinFFF

Do you see anything in the log files?
Thank you.

[    0.000000] Linux version 6.6.29-tegra (root@syslogic-desktop) (aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08) 11.3.0, GNU ld (GNU Binutils) 2.38) #1 SMP PREEMPT Mon May  6 13:37:47 CEST 2024

Could you fallback to K5.15 in JP6.0GA?
We don’t debug the issue in K6.6 directly.

Hi KevinFFF

We cannot fallback to K5.15 as the driver for the LAN7431/LAN8870 is only available for Kernel 6.6.

Kind regards

We don’t have the module and K6.6 with JP6.0GA to help you doing further debug.
Have you also asked the help from your vendor for this issue?

Hi KevinFFF

Yes we are in contact with the vendor.
However, we also have x86 platforms where we use the same LAN743x controllers, and they are working just fine also with MSI-X interrupts activated.

We came across the following topic:

Is it possible that we face a similar issue as mentioned there?
Thank you.

I cannot comment about if you hit the same issue since you are using different kernel (K6.6)
Does that patch help for your case?

Hi KevinFFF

No it does not help, however as the patch was for a different Kernel, we were not sure if we ported it correctly.

Maybe you can try verifying on K5.15 first since we don’t have the environment and the equipment to verify it on K6.6.

K5.15 is not an option at the moment as we do not have a working driver for that kernel version.

We did further tests and now the test of the LAN743x controllers works for some time without the pci=nomsi argument. But still the communication stops after some time with the message:

lan743x 0005:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)

When we add pci=noaer to the kernel arguments, we are able to recover the communication by doing:

ifconfig poe1 down
ifconfig poe1 up

We test the communication with iperf3. When we just ping the interfaces, we do not see any connection loses, therefore it seems to have to do with the bandwidth/interrupt amount.
Any help is really appreciated.
Thank you.

Hi KevinFFF

As the error is a “CmpltTO”, is there any possibility to increase this timeout? is this part of the pcie driver or specific for each pcie device?
Thank you.

Hi sevm89,

Guessing you’re using an out of kernel driver for your board? There does appear to be a PHY and controller driver for your hardware in the Nvidia kernel 5.15 in R36.3, have you tried that? You’ll need to enable the module as it isn’t configured by default.

I would also check whether your driver is enabling RTD3 runtime power management, because this doesn’t work with the Nvidia PCIe root port driver as it stands. Look for calls to pm_runtime_allow and remove them to test without.

I see your controller board has multiple 1GbE interfaces. If you do get it working you might find you’re not seeing very good performance, in which case the patch I wrote mentioned by KevinFFFF could be useful for you, but it doesn’t explain your current issues.

1 Like

Hi KevinFFF

It seems like finally we found the solution.
The error we have seen came from an errata of the pcie switch used in combination with the LAN743x controllers. Changing the settings of the switch solves the issue.
Thank you for your help in this case.

Kind regards
sevm89

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.