XavierNX EQOS LAN port sometimes doesn't link up

We are developing a common Custom Board for JetsonNano and XavierNX.

Only in the XavierNX environment, the Eth Port does not Link Up at startup.

The only difference between JetsonNano and XavierNX is the CPU Module,
and the circuit from CPU Module Socket to the RJ45 connector is exactly the same.

The following log doesn’t output from dmesg when Link Up is not performed.

[ 34.616304] eqos 2490000.ether_qos eth1: Link is Up - 1Gbps/Full - flow control rx/tx
[ 34.618148] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

The reproduce frequency is 1/110.

The Spped LED on the LAN Port is also not lit, I think which means that Auto Negotiation is not working properly.
The Link Up event does not seem to come up from the kernel phy_state_machine().

We have not made any changes to the EQOS driver except for the GPIO control of the LED.

Do we need to adjust any parameters of the EQOS phy for stable Auto Negotiation?

How many Xavier NX modules have your tried? Does the module work fine on dev kit? For this 1/110 occasional issue, it looks more like a custom board design tolerance problem.

Hi, any update? Did you compare your design to reference schematic? Did you follow the layout requirements listed in OEM DG to make the layout design? Is there any possibility on test steps?

Hi,

How many Xavier NX modules have your tried?

Two custom boards.

Does the module work fine on dev kit?

We don’t have XavierNX Dev Kit yet.

For this 1/110 occasional issue, it looks more like a custom board design tolerance problem.

According to Product Design Guide,
There is no difference between JetsonNano and XavierNX about several circuit design limitiation.

But In fact, we have a problem only XavierNX environment.

Our circuit designer says it meets the requirements of a design guide.

Just in case, We will conduct Ethernet Compliance Test with XavierNX module.

Is there any possibility on test steps?

In our environment, we have confirmed that problem occur not only at boot time, but also when EQOS IF is continuously turned down/up after boot.

while:
do

ifconfig eth1 down

sleep 5

ifconfig eth1 up

sleep 5

ping -I eth1 $SERVERIP -c 4

done

・PHY_AN succeed ( Link Up )

[ 1359.006360] phy_state_machine() PHY_UP
[ 1359.007015] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[ 1361.769380] phy_state_machine() PHY_AN
[ 1361.771345] eqos_adjust_link() set SPEED_1000
[ 1361.771699] eqos 2490000.ether_qos eth1: Link is Up - 1Gbps/Full - flow control rx/tx
[ 1361.772487] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

・PHY_AN didn’t occur ( Never Link Up )

[ 3331.328016] phy_state_machine() PHY_UP
[ 3331.328646] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready

The failure rate is 1/110 on both two modules? Even it is same design to NX and Nano, but still your custom design could be on the margin. It is hard to tell the root cause of such occasional failure. Did you check the board SMT quality?

After a bit of digging, I found the following POST.

I set “CONFIG_EQOS_DISABLE_EEE=y” and and evaluated it in the above ifconfig continuous down/up environment.

I repeated down/up 15128 times for 3days, PHY_AN was constantly occurring.

After that, I disabled “CONFIG_EQOS_DISABLE_EEE” again and repated down/up 2700 times for 12 hours, PHY_AN was not raised once.

Are there any issues with the EEE feature in EQOS in the XavierNX environment?

, PHY_AN was not raised once.

Are you trying to say ethernet is totally not able to up for 2700 times? Or you are trying to say issue does not happen?

We didn’t hear other users reporting the disconnection problem with CONFIG_EQOS_DISABLE_EEE.

Are you trying to say ethernet is totally not able to up for 2700 times? Or you are trying to say issue does not happen?

Sorry, It means “reproduce frequency is 1/2700.”

We purchased NVIDIA JETSON XAVIER NX DEVELOPER KIT,
and performed the link-up test during Board start-up as well as Custom Board.

As a result, the problem of EQOS (RTL8211F) not linking up at startup
was reproduced on EVK (JetPack 4.4.1:L4T 32.4.4) after 90 minutes of Continuous Reboot Test.

The reproduce frequency is about 1/107.

The reproduction procedure is as follows.

1. Download the SD Card Image of the NX DEVELOPER KIT from https://developer.nvidia.com/jetpack-sdk-441-archive

2. Extract jetson-nx-developer-kit-sd-card-image-441 and write sd-blob.img to the micro SDCard.

3. Boot the NX DEVELOPER KIT with SD Boot, Setup, and login.

4. Turn off Wifi and Bluetooth from the GUI.

5. Setting eth0 IPv4 to 10.0.0.77/8 from the GUI.

6. Connect the HostPC to the EVK via a Gbit Ether HUB as follows

NX DEVELOPER KIT <=> NETGEAR:GS-105E <=> HostPC
10.0.0.77/8                              10.0.0.36/8

7. Deploy the following shell script, service, and set up automated tests

/root/pe_test.sh
/lib/systemd/system/pe_test.service

# chmod a+x /root/pe_test.sh
# systemctl enable pe_test

8. Insert a USB memory stick in the USB slot in case you want to interrupt the test intentionally.
   The reboot test stops when the USB flash drive is unplugged or the problem is reproduced.

9. reboot from UART console and start continuous test.

The situation after reproduction is as follows.

[/root/pe_log/petest_log_20210603_151442.txt]
...
[2] EQOS [NG] date=20210603_151452
arg num=3
ITEM=EQOS
COMMAND=dmesg | grep eqos
EXPECTED=eqos 2490000.ether_qos eth0: Link is Up
RESULT=[    4.413594] eqos 2490000.ether_qos: Setting local MAC: 48 b0 2d 3d 78 86

The Link is Up event is not raised from the kernel and the IP is not assigned.
Perhaps PHY_AN even is not raised in phy_state_machine().

root@contec-desktop:~# ifconfig 
eth0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether 48:b0:2d:3d:78:86  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 37  

root@contec-desktop:~# dmesg | grep eqos
[    4.413594] eqos 2490000.ether_qos: Setting local MAC: 48 b0 2d 3d 78 86

In mii-tool, the link is ok, but in ethtool, it is Link detected: no.

root@contec-desktop:~# mii-tool eth0
eth0: negotiated 1000baseT-FD flow-control, link ok

root@contec-desktop:~# ethtool eth0
Settings for eth0:
        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: Unknown!
        Duplex: Unknown! (255)
        Port: MII
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Link detected: no

The LAN Port LEDs are both lit.

Left:  Lights up Green
Right: Lights up Orange (sometimes blink)

This problem was reproduced not only in our CustomBoard, but also in EVK.
Could NVIDIA deal with this problem? Thanks.

pe_test.sh (3.1 KB)
pe_test.service (317 Bytes)
dmesg.txt (63.3 KB)

Hi,

Can you use the jetpack4.5.1 to test? We always start the debug from the latest release but not the old jetpack.

Can you use the jetpack4.5.1 to test? We always start the debug from the latest release but not the old jetpack.

Hi, We built the environment again with jetpack4.5.1 and confirmed that it reproduces.

The reproduce frequency is about 1/422.

The reproduction procedure is as follows.

1. Download the SD Card Image of the NX DEVELOPER KIT from https://developer.nvidia.com/jetpack-sdk-451-archive

2. Extract jetson-nx-jp451-sd-card-image.zip and write sd-blob.img to the micro SDCard.

3. Boot the NX DEVELOPER KIT with SD Boot, Setup, and login.

4. Turn off Wifi and Bluetooth from the GUI.

5. Setting eth0 IPv4 to 10.0.0.77/8 from the GUI.

6. Connect the HostPC to the EVK via a Gbit Ether HUB as follows

NX DEVELOPER KIT <=> NETGEAR:GS-105E <=> HostPC
10.0.0.77/8                              10.0.0.36/8

7. Deploy the following shell script, service, and set up automated tests to the NX DEVELOPER KIT

/root/pe_test.sh
/lib/systemd/system/pe_test.service

pe_test.sh (3.1 KB)
pe_test.service (317 Bytes)

# chmod a+x /root/pe_test.sh
# systemctl enable pe_test

8. Insert a USB memory stick in the USB slot in case you want to interrupt the test intentionally.
   The reboot test stops when the USB flash drive is unplugged or the problem is reproduced.

9. reboot from UART console and start continuous test.

The situation after reproduction is almost same as jetpack4.4.1.
The “Link is Up” event is not raised from the kernel and the IP addr is not assigned.

This is not custom board design tolerance problem.

Could you deal with this issue on the NVIDIA side?

dmesg.txt (66.4 KB)

Just want to confirm, so if you remove CONFIG_EQOS_DISABLE_EEE from the kernel config, this issue will not happen?

First of all, CONFIG_EQOS_DISABLE_EEE is not set by default in jetpack4.4.1/4.5.1.
So, if you want to disable EEE, CONFIG_EQOS_DISABLE_EEE should be Y.

# uname -a
Linux contec-desktop 4.9.201-tegra #1 SMP PREEMPT Fri Feb 19 08:42:04 PST 2021 aarch64 aarch64 aarch64 GNU/Linux

# zcat /proc/config.gz | grep EQOS
# CONFIG_EQOS_APE_HWDEP is not set
CONFIG_EQOS=y
# CONFIG_EQOS_DISABLE_EEE is not set
# CONFIG_DISABLE_EQOS_CTRL_TRISTATE is not set

We tested by Our Custom Board envirnoment with CONFIG_EQOS_DISABLE_EEE=y, but issue was happened.

I haven’t tried it with EVK yet.
Because I thought I should try it with NVIDIA’s original Kernel Image first.

Hi,

But it looks like your comment here says enable CONFIG_EQOS_DISABLE_EEE will help enhance this issue, which one is correct? Also, what is your criteria for this issue? How many times of reboot is required to pass your test?

But it looks like your comment here says enable CONFIG_EQOS_DISABLE_EEE will help enhance this issue, which one is correct?

That’s because the testing conditions are different.

I set “CONFIG_EQOS_DISABLE_EEE=y” and and evaluated it in the above ifconfig continuous down/up environment.

It was continous ifconfig down/up test in board running (not rebooting) and PHY_AN was constantly occurred for three days.

But in avobe reboot test, even with CONFIG_EQOS_DISABLE_EEE=y, sometimes PHY_AN was not occurred.

Also, what is your criteria for this issue? How many times of reboot is required to pass your test?

Our criteria is to keep rebooting for two days, about 1500 boot times with no issue.

There is a clear difference when compared to the Jetson Nano about PHY_AN error rate.

And, As a specification of Gbit Ethernet, Auto Negotiation should not fail with such a high error rate.

Got it. We will check this issue. Thanks for reporting and summary.

Hi,

Also want to ask one more question.

Do we have to use host <-> devkit connection to reproduce this issue or we can use a router here?

Hi,

Do we have to use host ↔ devkit connection to reproduce this issue or we can use a router here?

As described in the previous instructions, we first confirmed that
the problem was reproduced in the following configuration.

NX DEVELOPER KIT <=> NETGEAR:GS-105E <=> HostPC
10.0.0.77/8                              10.0.0.36/8

And yesterday, I confirmed the same reproduction in the following environment.

NX DEVELOPER KIT <=> JETSON NANO DEVELOPER KIT
10.0.0.77/8          10.0.0.36/8

If you don’t want to doubt the influence of other nodes and packets connected to the router,
I think that it is better to connect the nodes directly to each other.

Hi,

Also want to know, will the interface back to working state if your use ifconfig to up/down this interface?

Also want to know, will the interface back to working state if your use ifconfig to up/down this interface?

Yes. The following manual operation will return to normal operation.
Note that you need to wait about 5 seconds after down.

ifconfig eth0 down

sleep 5

ifconfig eth0 up

This is probably because the Auto Negotiation sequence will be run again.