AQR113C PHY firmware update corrupted

Please provide the following info (tick the boxes after creating this topic):
Software Version
DRIVE OS 6.0.6
DRIVE OS 6.0.5
DRIVE OS 6.0.4 (rev. 1)
DRIVE OS 6.0.4 SDK
other

Target Operating System
Linux
QNX
other

Hardware Platform
DRIVE AGX Orin Developer Kit (940-63710-0010-300)
DRIVE AGX Orin Developer Kit (940-63710-0010-200)
DRIVE AGX Orin Developer Kit (940-63710-0010-100)
DRIVE AGX Orin Developer Kit (940-63710-0010-D00)
DRIVE AGX Orin Developer Kit (940-63710-0010-C00)
DRIVE AGX Orin Developer Kit (not sure its number)
other

SDK Manager Version
1.9.3.10904
other

Host Machine Version
native Ubuntu Linux 20.04 Host installed with SDK Manager
native Ubuntu Linux 20.04 Host installed with DRIVE OS Docker Containers
native Ubuntu Linux 18.04 Host installed with DRIVE OS Docker Containers
other

Hello,

We are using the drive AGX kits available 10GBASE-T connection (mgbe1_0) to communicate to a 10G-T switch. The cable length is <5m and is CAT6 STP.

Recently we noticed that a few devices would fail autonegotiation and we would not get a link on this port. After some time the Orin would report the following PCS error.

[243088.778675] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][449][type:0x4][loga-0x0] Failed to get PCS block lock

In an attempt to remedy the situation, we thought to update the AQR113C PHY included in the AGX dev kit using the nvidia provided tooling:

/lib/firmware/marvell_ethernet/AQR113C$ ./flash_aqr113c

We first ran the --IsReady flag to ensure the PHY was ready:

$ ./flash_aqr113c --IsReady mgbe1_0
Device initialization done and is READY for flashing

Then we checked to make sure the PHY needed an update:

$ ./flash_aqr113c --GetCurrentVersion mgbe1_0
5.6
$ ./flash_aqr113c --VersionCompare mgbe1_0 AQR-G4_v5.6.1-AQR_Marvell_NoSwap_XFI_ID44874_VER1836.cld
Current FW version: 5.6
Input FW version is AQR-G4_v5.6.1-AQR_Marvell_NoSwap_XFI_ID44874_VER1836.cld
Input FW version is Lesser

The file indicated version 5.6.1 so a flash was attempted.

$ ./flash_aqr113c --Install mgbe1_0 AQR-G4_v5.6.1-AQR_Marvell_NoSwap_XFI_ID44874_VER1836.cld

This did succeed for a while but then began failing with mismatch errors, ultimately failing with error 209.

  Bytes: 0x16F00
  Bytes: 0x17000
  Bytes: 0x17100
  Bytes: 0x17200
  Bytes: 0x17300
  Mismatch on byte 0x17358: Read 0xB1 - Should be: 0x90
  Mismatch on byte 0x17359: Read 0x69 - Should be: 0xA1
  Mismatch on byte 0x1735A: Read 0xA8 - Should be: 0x87
  Mismatch on byte 0x1735B: Read 0xC - Should be: 0xAC
  Mismatch on byte 0x1735C: Read 0xC - Should be: 0xB1
...
  Bytes: 0x5FE00
  Bytes: 0x5FF00
  Bytes: 0x60000
ret 209
Fail to Flash FW image with 209

After a cold reboot (full AGX power cycle, the firmware of the PHY seems corrupted and the PHY no longer links up to any BASE-T link partner. The firmware when probed also now reports 0.0.

$ ./flash_aqr113c --GetCurrentVersion mgbe1_0
0.0

What is the proper method to update / flash the PHY firmware on the Drive AGX Orin dev kit with DriveOS for the Aquantia AQR113C?

Dear @bmargosian2,
We are checking internally with our team on this issue. I will let you know once I have an update.

Dear @bmargosian2,
Can you check flashing the FW again with same command?

Hi @SivaRamaKrishnaNV ,

We did do this and had the following results:

  1. On a system that has been running for a few hours, the update would fail continuously. Attempted at least 3 times. Same return code of 209.

  2. On a system that was left to sit overnight, the initial update would fail (result 209), but a subsequent update would pass with the following result:

OK (CRC 0x394E)
Flashing done sucessfully
New FW version: 0.0

However after the AQR113C PHY was successfully flashed, and the Orin AGX kit was power cycled, we still did not have a link on the 10GBASE-T mgbe1_0 port.

We did run the ./flash_aqr113c --GetCurrentVersion mgbe1_0 command to check the version and it reported 5.6.

Running ethtool mgbe1_0 would show the status changing from UNKNOWN to 1000Mb with link detected: no continuously and the link partner not showing a link detected either. This was attempted on multiple link partners and cables.

How can we fix the link issue?

@SivaRamaKrishnaNV is there Any Fix On this issue. I also encountered on same.

Dear @riteshg,
is there Any Fix On this issue. I also encountered on same.

Did you try updating the AQR113C PHY and notice issues? If you have not updated it, please hold on. Internal discussion is going on on this.

Yes we tried updating the AQR113C PHY and then we stuck on this. Link doesnt come up.

reason for this update since i was not getting 10gbps data rate. were only got 2gbps on point to point connection and suspected on PHY firmware.

Dear @riteshg,
Could you share dmesg/boot logs and ethtool log? I assume this mgbe1_0 interface is working earlier and stopped working after flashing FW?

dmesg --syslog.rtf (74.9 KB)
journalctl -b -1.rtf (2.9 KB)

Dear @abelton ,
Could you share the ethtool <iface> output as well?

Dear @abelton
from dmesg logs it confirm that the link is up with 10G on device side and we should be able to get the data transfers up after the link is up.

[   20.256400] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][456][type:0x4][loga-0x0] PCS block lock SUCCESS\
[   20.257274] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 10Gbps/Full - flow control off\
[   20.257431] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 10Gbps/Full - flow control off\

From ifconfig I see some packets are also recevied and transmitted. I also see the MTU is modified to 9K means you are validating the Jumbo.

mgbe1_0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 8966\
        inet 10.0.1.11  netmask 255.255.255.0  broadcast 10.0.1.255\
        inet6 fe80::4ab0:2dff:fec1:1c1  prefixlen 64  scopeid 0x20<link>\
        ether 48:b0:2d:c1:01:c1  txqueuelen 1000  (Ethernet)\
        RX packets 66962763  bytes 65534615248 (65.5 GB)\
        RX errors 0  dropped 0  overruns 0  frame 0\
        TX packets 64466  bytes 5984445 (5.9 MB)\
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\
  1. Are you able to ping to mgbe1_0 on 10.0.1.11 ? Is this IP from the same network?
  2. In ping failure case can you check the “ifconfig mgbe1_0” and see if any of the TX/RX packets getting incremented?
  3. In failure case can you capture the continuous output for “ethtool -S mgbe1_0” output along with the ping. While capturing this output make sure the ping running in background so that we can check for MAC counters
  4. In ping failure case can you get the tcpdump on device side and host side for the below two cases?
  • Run ping from host side and check the tcpdump on device side.
  • Run ping on device side and check the tcpdump on host side.
  1. Can you validate it with default MTU which is 1500?
  2. Just to rule out the host from our suspect, please check if host is working with another device in this experiment?

Bad ethtool.cap (1.4 KB)
Bad syslog.cap (125.3 KB)

Bad ifconfig.rtf (4 KB)
Bad journalctl.rtf (2.0 KB)

Bad ethtool .rtf|attachment (2.1 KB)

Dear @abelton,
Could you check trying a Hardware reset by writing 0x1 to register 1E.2681 on AQR113C via MDIO to see if it recovers it from a bad state.
Please see Orin can not read the PHY reg data - #15 by WayneWWW and try
if phytool write mgbe1_0/0:0x1e/0x2681 0x1 helps?

Hello @SivaRamaKrishnaNV,

This method seems to allow the phy to link up. See below results:

Prior to phy reset:

[  624.875195] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][449][type:0x4][loga-0x0] Failed to get PCS block lock

PHY read success as 0x0000

nvidia@tegra-ubuntu:~/phytool$ sudo ./phytool read mgbe1_0/0:0x1e/0x2681
0000

ethtool cycling between Unknown and 1000Mb link DOWN

nvidia@tegra-ubuntu:~/phytool$ sudo ethtool mgbe1_0
Settings for mgbe1_0:
	Supported ports: [ ]
	Supported link modes:   100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: Unknown!
	Duplex: Unknown! (255)
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: Unknown
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000000 (0)

	Link detected: no
nvidia@tegra-ubuntu:~/phytool$ sudo ethtool mgbe1_0
Settings for mgbe1_0:
	Supported ports: [ ]
	Supported link modes:   100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  1000baseT/Full
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: No
	Link partner advertised FEC modes: Not reported
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: Unknown
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000000 (0)

	Link detected: no

After issuing the phy reset via register 1E.2681:

nvidia@tegra-ubuntu:~/phytool$ sudo ./phytool write mgbe1_0/0:0x1e/0x2681 0x1

We see the following on dmesg

[  624.862970] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[  624.864057] IPv6: ADDRCONF(NETDEV_CHANGE): mgbe1_0: link becomes ready
[  624.875195] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][449][type:0x4][loga-0x0] Failed to get PCS block lock
[  704.734969] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[  783.582981] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[  862.430970] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[  941.278968] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[ 1019.102976] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[ 1097.950973] Aquantia AQR113C 6910000.ethernet:00: Downshift occurred from negotiated speed 1Gbps to actual speed 100Mbps, check cabling!
[ 1163.488017] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][456][type:0x4][loga-0x0] PCS block lock SUCCESS
[ 1163.489625] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 1Gbps/Full - flow control off
[ 1584.350876] nvethernet 6910000.ethernet mgbe1_0: Link is Down
[ 1598.431223] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][456][type:0x4][loga-0x0] PCS block lock SUCCESS
[ 1598.432251] nvethernet 6910000.ethernet mgbe1_0: Link is Down
[ 1598.686898] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 10Gbps/Full - flow control off
[ 1599.391220] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][456][type:0x4][loga-0x0] PCS block lock SUCCESS
[ 1599.392235] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 10Gbps/Full - flow control off
[ 1599.999233] nvethernet 6910000.ethernet: [xpcs_lane_bring_up][456][type:0x4][loga-0x0] PCS block lock SUCCESS
[ 1600.000257] nvethernet 6910000.ethernet mgbe1_0: Link is Up - 10Gbps/Full - flow control off

Link is up at 10Gb after an RJ45 cable plug/unplug (see dmesg)

nvidia@tegra-ubuntu:~/phytool$ sudo ethtool mgbe1_0
Settings for mgbe1_0:
	Supported ports: [ ]
	Supported link modes:   100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  100baseT/Half 100baseT/Full
	                        1000baseT/Full
	                        1000baseKX/Full
	                        10000baseT/Full
	                        10000baseKX4/Full
	                        10000baseKR/Full
	                        2500baseT/Full
	                        5000baseT/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Link partner advertised link modes:  10baseT/Half 10baseT/Full
	                                     100baseT/Half 100baseT/Full
	                                     1000baseT/Full
	                                     10000baseT/Full
	                                     2500baseT/Full
	                                     5000baseT/Full
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported
	Speed: 10000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	MDI-X: Unknown
	Supports Wake-on: d
	Wake-on: d
	Current message level: 0x00000000 (0)

	Link detected: yes

Why does issuing a PHY reset cause the PCS to re-link here? Is there a race condition in the Orin PCS startup?

The reason issuing a PHY reset causes the PCS to re-link is due to an issue with the 10G PHY (AQR113C) not being properly initialized upon power-on reset. Your feedback in identifying this issue is greatly appreciated, and we will now work on a solution to address it.

Why does it only occur on some AGX devices and not all? Any other details you can provide?
We also saw these devices not show the issue for hours of runtime and then show the issue continuously later. What would cause the device failure mode to appear later after use?

We will wait for your proposed solution (I assume it will be executing a call like this at boot or modifying the nvethernet driver to send this command at initialization for mgbe1_0).

Can you confirm if you began experiencing this failure after some time of using the system, and once the failure occurred, it was easily reproducible? Additionally, did the use of the phy-resetting workaround allow for a successful recovery? Is this summary accurate?

Yes, we saw the failure on some systems after some time of using them (probably 5 hours or more). It would only appear after a power cycle though, not at runtime, but after a system showed the issue it was fairly reproducible.

Some systems would show this issue on every subsequent boot cycle (cold or warm), and some would appear to work on the first power cycle after left OFF overnight (cold). The systems that seemed to work on first boot when left overnight, we could reproduce the issue if the system was power cycled after running for a bit.

I can confirm that every system I have tested the PHY reset method above (2 systems so far) this has allowed the link to come up for the current boot cycle. Soft reboots (sudo reboot or common_if_testapp -mcureset) would not show the issue again, but a full power cycle (unplugging power, waiting 90s, and plugging power back in) would require the PHY reset to be issued again before the link would come up.

1 Like

Dear @riteshg ,
Could you file a new topic with requested logs incase you are struck with this issue? Thanks