Summary
On a ConnectX-5 NIC, if cable is repeatedly connected and disconnected on a frequent basis, the link state will eventually remain down and cannot become up without NIC reset.
System Information
The server tested is running BC-Linux 8.2 with latest MLNX_OFED
driver and firmware. Two dual-port ConnectX-5 NICs are installed, each of which has one port connected to switch with a DAC cable. However, only one NIC is necessary to reproduce this issue. The port ens1f0np0
is used here.
Miscellaneous Version Information
[root@localhost ~]# uname -a
Linux localhost.localdomain 4.19.0-240.23.11.el8_2.bclinux.x86_64 #1 SMP Wed Jun 2 16:11:31 CST 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# ofed_info -s
MLNX_OFED_LINUX-5.8-1.0.1.1:
[root@localhost ~]# lspci | grep -i mellanox
3b:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
3b:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
5e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
5e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
[root@localhost ~]# ibv_devinfo
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 16.35.1012
node_guid: 248a:0703:00a3:e3bc
sys_image_guid: 248a:0703:00a3:e3bc
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 16.35.1012
node_guid: 248a:0703:00a3:e3bd
sys_image_guid: 248a:0703:00a3:e3bc
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.35.1012
node_guid: b859:9f03:00c1:f1aa
sys_image_guid: b859:9f03:00c1:f1aa
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.35.1012
node_guid: b859:9f03:00c1:f1ab
sys_image_guid: b859:9f03:00c1:f1aa
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
[root@localhost ~]# ethtool ens1f0np0
Settings for ens1f0np0:
Supported ports: [ Backplane ]
Supported link modes: 1000baseKX/Full
10000baseKR/Full
40000baseKR4/Full
40000baseCR4/Full
40000baseSR4/Full
40000baseLR4/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
50000baseCR2/Full
50000baseKR2/Full
100000baseKR4/Full
100000baseSR4/Full
100000baseCR4/Full
100000baseLR4_ER4/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None BaseR RS
Advertised link modes: 1000baseKX/Full
10000baseKR/Full
40000baseKR4/Full
40000baseCR4/Full
40000baseSR4/Full
40000baseLR4/Full
25000baseCR/Full
25000baseKR/Full
25000baseSR/Full
50000baseCR2/Full
50000baseKR2/Full
100000baseKR4/Full
100000baseSR4/Full
100000baseCR4/Full
100000baseLR4_ER4/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: RS
Speed: 100000Mb/s
Duplex: Full
Port: Direct Attach Copper
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000004 (4)
link
Link detected: yes
Steps to reproduce
- Connect server port to switch. Verify the link is up and running.
- Shutdown the switch port. Link state on the server will soon become down.
- After 5 seconds, bring up the switch port.
- Once server link becomes up, immediately shutdown the switch port again.
- Repeat steps 3 and 4 for about 15 times.
- Eventually server link will not become up again even if switch port is kept enabled.
[root@localhost ~]# ip link show ens1f0np0
52: ens1f0np0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
link/ether b8:59:9f:c1:f1:aa brd ff:ff:ff:ff:ff:ff
[root@localhost ~]# mlxlink -d /dev/mst/mt4119_pciconf0
Operational Info
----------------
State : Polling
Physical state : ETH_AN_FSM_ABILITY_DETECT
Speed : N/A
Width : N/A
FEC : N/A
Loopback Mode : No Loopback
Auto Negotiation : ON
Supported Info
--------------
Enabled Link Speed : 0xf8f1f0d3 (100G,50G,40G,25G,10G,1G)
Supported Cable Speed : 0x48101165 (100G,56G,50G,40G,25G,20G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 2
Group Opcode : PHY FW
Recommendation : Negotiation failure
Tool Information
----------------
Firmware Version : 16.35.1012
MFT Version : mft 4.22.0-96
If negotiation had been disabled (for both server and switch), the error message would be ‘Other issues’:
[root@localhost ~]# mlxlink -d /dev/mst/mt4119_pciconf0
Operational Info
----------------
State : Polling
Physical state : ETH_AN_FSM_ABILITY_DETECT
Speed : N/A
Width : N/A
FEC : N/A
Loopback Mode : No Loopback
Auto Negotiation : FORCE - 100G
Supported Info
--------------
Enabled Link Speed : 0x00f00000 (100G)
Supported Cable Speed : 0x48101165 (100G,56G,50G,40G,25G,20G,10G,1G)
Troubleshooting Info
--------------------
Status Opcode : 36
Group Opcode : PHY FW
Recommendation : Other issues
Tool Information
----------------
Firmware Version : 16.35.1012
MFT Version : mft 4.22.0-96
Resetting the NIC with mlxfwreset
is usually enough to bring link back up. Any ideas on how to resolve this issue? Thanks!