We have developed a custom board that is compatible with the Jetson Nano, Xavier NX and TX2 NX. There is a PCIE switch on the board and a 1GBit PHY (RTL8111E) behind it.
This is identical to the internal PHY of the Jetson Nano, so they also load the same driver (r8168). We have adjusted the file r8168_n.c so that we don’t create the same kernel object twice, which led to a kernel panic. We also do not have an EEPROM for our PCIE-1Gbit-Phy, so we use the random_mac() function. (r8168_n_custom.c.txt attached). We didn’t change anything in the device tree regarding the ethernet phy.
We currently have the following problem:
When we boot with the Ethernet PHY 1 (Jetson internal) and the Ethernet cable plugged in, we do not get kernel panics, the Ethernet PHY 1 & 2 have a valid MAC address and when we plug the Ethernet cable into the Ethernet PHY 2 (External PHY), the interface gets an IP address and can communicate on the network. → Works as designed (dmesg_works.txt)
When we boot with the Ethernet PHY 2 (Jetson External) and the Ethernet cable plugged in, we get a kernel panic after switching the link UP. Most of the time the Linux boots and gets slower and slower (because of the kernel panic flushing) until it crashes - (syslog_failed.txt output attached). If Ubuntu boots and then you unplug the LAN cable, it also goes straight into a kernel panic and the system freezes.
Hi, we’ve checked the PCIE Pin Mapping, we’re using PCIE0 with lane 0. But at this point everything should work, since we also we’re able to get IP Pakets through the external Realtek Chip.
But this happens after replugging a ethernet cable on the external chip:
Dec 14 20:54:00 edgekit-desktop kernel: [ 8.644184] WARNING: CPU: 0 PID: 3961 at /data/Linux_for_Tegra/source/public/kernel/kernel-4.9/drivers/iommu/tegra-smmu.c:901 __smmu_iommu_map_pfn_default+0x21c/0x228 as described in the syslog_failed.txt
/* disable ASPM completely as that cause random device stop working
* problems as well as full system hangs for some PCIe devices users */
pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 |
PCIE_LINK_STATE_CLKPM);
It’s the original code from Jetpack 4.6.1, we’ve just added a mac-address and kernel-object handling. So the code has the lines 3519 - 3527:
/*
* FIFO overrun errors are observed when ASPM is enabled
* and flow control is disabled. This is causing perf
* drop. So disable ASPM if flow control is disabled.
*/
if (aspm && ((RTL_R8(PHYstatus) & (TxFlowCtrl | RxFlowCtrl)) !=
(TxFlowCtrl | RxFlowCtrl)))
pci_disable_link_state(tp->pci_dev, PCIE_LINK_STATE_L0S |
PCIE_LINK_STATE_L1 | PCIE_LINK_STATE_CLKPM);
I could rebuild the ethernet driver with debug prints, to check if this function is called. Or do i need to add something, that this is called?
Hi @WayneWWW, we’ve build up a new board, without the pcie switch. Same error, UART log attached. SN_Eth2_boot.txt (37.8 KB)
I think the problem is, that the ethernet phy’s load the same driver and the driver maybe has a fixed IOMMU definition and cannot seperate between two different hardware-phy’s?
Really strange new informations. When im trying the same board (and driver) with the Xavier NX the realtek 8111F ethernet-phy works after 2-3 minutes, but it hasn’t any kernel panics. After 2-3 minutes the networking works and plugging out the ethernet cable also works without any problems.
So the r8168 driver shouldn’t be the problem in our issue. Any new idea’s what it could be?
I have some updates in this topic, after days with the same problem and analyzing & trying everything we saw, that the R8168 Driver with the 8111F Realtek chip running on Ubuntu 18.04 is highly unstable. So on the Xavier NX we’re not building the R8168 driver anymore and let the R8169 driver do his thing. IP address via DHCP and networking over longer times also works. The only issue we have right now is: It just works in reboot scenarios and not in shutdown scenarios. (Reboot = IP address & everything works; Shutdown = Just MAC address and no IP address possible).