MCX555A-ECAT card installed in Windows 2019. No link using Inifiniband

Hello.

I purchased a ConnectX-5 VPI adapter card (MCX555A-ECAT) and have it running in a PCIe 4.0 x16 slot on my new H12SSL-i motherboard. OS is Windows 2019.

I’m using a Mellanox MSX6005F-2BFS Switch and new MC2207128-003 passive copper cable.

The subnet manager (opensm) is running on a CentOS with connectx-2 vpi card.

@strsrv ~]$ sudo cat /etc/system-release
CentOS Linux release 7.9.2009 (Core)

@strsrv ~]$ sudo yum info opensm-3.3.21-4.el7_9.x86_64>
Installed Packages
Name : opensm
Arch : x86_64
Version : 3.3.21
Release : 4.el7_9
Size : 1.4 M
Repo : installed
From repo : updates
Summary : OpenIB InfiniBand Subnet Manager and management utilities

@strsrv ~]$ lspci | grep Mellanox
82:00.0 Network controller: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In… (rev b0)

My problem is the ConnectX-5 card never gets a link on the Windows side when connected to the SX6005 Switch. Despite trying different cables and using the current version of WinOF-2 and firmware.

Not working:

c:\Program Files\Mellanox\MLNX_WinOF2\Management Tools>mlx5cmd -dbg -pddrinfo
NIC 1:
Adapter: Mellanox ConnectX-5 Adapter
Location (PCI bus, device, function): (129,0,0)
Operational Info
State : Polling
Active protocol : InfiniBand
Physical state : 0
Active link width : Unknown
Enabled PHY manager link width : {1x}
Enabled core to PHY link width : {1x}
Active link speed : Unknown
Enabled PHY manager link speed : {SDR, DDR, QDR, FDR10, FDR, EDR}
Enabled core to PHY link speed : {SDR, DDR, QDR, FDR10, FDR}
Active negotiation mode : 4
Loopback mode : 0
FEC : 0
Cable supported speeds : {SDR, DDR, QDR, FDR10, FDR}

Troubleshoot Info
Status Opcode : 2
Group Opcode : 0
Message : Negotiation failure

c:\Program Files\Mellanox\WinMFT>flint -d mt4119_pciconf0 query
Image type: FS4
FW Version: 16.35.3006
FW Release Date: 6.7.2023
Product Version: 16.35.3006
Rom Info: type=UEFI version=14.29.15 cpu=AMD64
type=PXE version=3.6.902 cpu=AMD64
Description: UID GuidsNumber
Base GUID: 506b4b030043fb22 4
Base MAC: 506b4b43fb22 4
Image VSD: N/A
Device VSD: N/A
PSID: MT_0000000010
Security Attributes: N/A

c:\Program Files\Mellanox\WinMFT>mlxlink -d mt4119_pciconf0

Operational Info

State : Polling
Physical state : ETH_AN_FSM_ABILITY_DETECT
Speed : N/A
Width : N/A
FEC : N/A
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)
Supported Cable Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info

Status Opcode : 2
Group Opcode : PHY FW
Recommendation : Negotiation failure

Tool Information

Firmware Version : 16.35.3006
MFT Version : mft 4.26.1-3

C:\Users\Administrator.XTC41>mlxcables
Querying Cables …

Cable #1:

Cable name : mt4119_pciconf0_cable_0

No FW data to show
-------- Cable EEPROM --------
Identifier : QSFP+ (0dh)
Technology : Copper cable unequalized (a0h)
Compliance : 40GBASE-CR4, FDR,QDR,DDR,SDR
Attenuation: 2.5GHz : 6dB
5.0GHz : 10dB
7.0GHz : 13dB
12.9GHz : 0dB
25.78GHz : 0dB
OUI : 0x0002c9
Vendor : Mellanox
Serial number : MT2022VS01614
Part number : MC2207128-003
Revision : A3
Temperature [c] : N/A
Digital Diagnostic Monitoring : NO
Length [m] : 3 m

If I connect the ConnectX-2 card to the ConnectX-5 card back to back (no SX6005 Switch) with the same FDR cable (MC2207128-003) both sides show link up.

If I remove the and FDR cable from a working connectx-2 card and put it in the connectx-5 card leaving the port side alone on the SX6005 Switch, the connectx-5 comes up. Then taking one of the new FDR cables and using another port on the Switch and attach it to the connectx-2 card, it doesn’t link up.

ConnectX-5 Card when its up:

c:\Program Files\Mellanox\MLNX_WinOF2\Management Tools>mlx5cmd -dbg -pddrinfo
NIC 1:
Adapter: Mellanox ConnectX-5 Adapter
Location (PCI bus, device, function): (129,0,0)
Operational Info
State : Active
Active protocol : InfiniBand
Physical state : 7
Active link width : {4x}
Enabled PHY manager link width : {1x}
Enabled core to PHY link width : {1x}
Active link speed : {FDR}
Enabled PHY manager link speed : {SDR, DDR, QDR, FDR10, FDR, EDR}
Enabled core to PHY link speed : {SDR, DDR, QDR, FDR10, FDR}
Active negotiation mode : 1
Loopback mode : 0
FEC : 0
Cable supported speeds : {SDR, DDR, QDR, FDR10, FDR}

    Troubleshoot Info
            Status Opcode                      : 0
            Group Opcode                       : 0
            Message                            : No issue was observed

C:\Program Files\Mellanox\WinMFT>mlxlink -d mt4119_pciconf0

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-FDR
Width : 4x
FEC : No FEC
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)
Supported Cable Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 16.35.3502
MFT Version : mft 4.26.1-3

Can someone explain what’s going on here?

Thanks

After thinking about this overnight, I decided to reboot the SX6005 Switch this morning. Well what do you know, the card came up. But this is Switch has no management built on it.

C:\Users\Administrator.XTC41>mlxcables
Querying Cables …

Cable #1:

Cable name : mt4119_pciconf0_cable_0

No FW data to show
-------- Cable EEPROM --------
Identifier : QSFP+ (0dh)
Technology : Copper cable unequalized (a0h)
Compliance : 40GBASE-CR4, FDR,QDR,DDR,SDR
Attenuation: 2.5GHz : 6dB
5.0GHz : 10dB
7.0GHz : 13dB
12.9GHz : 0dB
25.78GHz : 0dB
OUI : 0x0002c9
Vendor : Mellanox
Serial number : MT2022VS01614
Part number : MC2207128-003
Revision : A3
Temperature [c] : N/A
Digital Diagnostic Monitoring : NO
Length [m] : 3 m

C:\Users\Administrator.XTC41>mlx5cmd -dbg -pddrinfo
NIC 1:
Adapter: Mellanox ConnectX-5 Adapter
Location (PCI bus, device, function): (129,0,0)
Operational Info
State : Active
Active protocol : InfiniBand
Physical state : 7
Active link width : {4x}
Enabled PHY manager link width : {1x}
Enabled core to PHY link width : {1x}
Active link speed : {FDR}
Enabled PHY manager link speed : {SDR, DDR, QDR, FDR10, FDR, EDR}
Enabled core to PHY link speed : {SDR, DDR, QDR, FDR10, FDR}
Active negotiation mode : 1
Loopback mode : 0
FEC : 0
Cable supported speeds : {SDR, DDR, QDR, FDR10, FDR}

    Troubleshoot Info
            Status Opcode                      : 0
            Group Opcode                       : 0
            Message                            : No issue was observed

C:\Users\Administrator.XTC41>mlxlink -d mt4119_pciconf0

Operational Info

State : Active
Physical state : LinkUp
Speed : IB-FDR
Width : 4x
FEC : No FEC
Loopback Mode : No Loopback
Auto Negotiation : ON

Supported Info

Enabled Link Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)
Supported Cable Speed : 0x0000001f (FDR,FDR10,QDR,DDR,SDR)

Troubleshooting Info

Status Opcode : 0
Group Opcode : N/A
Recommendation : No issue was observed

Tool Information

Firmware Version : 16.35.3502
MFT Version : mft 4.26.1-3

[@strsrv ~]$ sudo ibnetdiscover

Topology file: generated on Sun Feb 4 07:24:58 2024

Initiated from node 0002c903000a4de4 port 0002c903000a4de5

vendid=0x2c9
devid=0xc738
sysimgguid=0xf452140300044f80
switchguid=0xf452140300044f80(f452140300044f80)
Switch 12 “S-f452140300044f80” # “SwitchX - Mellanox Technologies” base port 0 lid 6 lmc 0
[1] "H-0002c9030059c866"1 # “wrkbox1.xtc41.com HCA-1” lid 8 4xQDR
[2] "H-0002c903000a4de4"1 # “strsrv mlx4_0” lid 2 4xQDR
[3] "H-0002c9030059c862"2 # “OLSEN” lid 1 4xQDR
[4] "H-0002c903000a4de4"2 # “strsrv mlx4_0” lid 10 4xQDR
[5] "H-0002c903000d7ee8"1 # “PEGASUS” lid 4 4xQDR
[7] "H-506b4b030043fb22"1 # “APPSRVR ibp129s0f0” lid 5 4xFDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000d7eeb
caguid=0x2c903000d7ee8
Ca 1 “H-0002c903000d7ee8” # “PEGASUS”
1 “S-f452140300044f80”[5] # lid 4 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xQDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c9030059c865
caguid=0x2c9030059c862
Ca 2 “H-0002c9030059c862” # “OLSEN”
2 “S-f452140300044f80”[3] # lid 1 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xQDR

vendid=0x2c9
devid=0x1017
sysimgguid=0x506b4b030043fb22
caguid=0x506b4b030043fb22
Ca 1 “H-506b4b030043fb22” # “APPSRVR ibp129s0f0”
1 “S-f452140300044f80”[7] # lid 5 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xFDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c9030059c869
caguid=0x2c9030059c866
Ca 2 “H-0002c9030059c866” # “wrkbox1.xtc41.com HCA-1”
1 “S-f452140300044f80”[1] # lid 8 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xQDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000a4de7
caguid=0x2c903000a4de4
Ca 2 “H-0002c903000a4de4” # “strsrv mlx4_0”
1 “S-f452140300044f80”[2] # lid 2 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xQDR
2 “S-f452140300044f80”[4] # lid 10 lmc 0 “SwitchX - Mellanox Technologies” lid 6 4xQDR

You can probably reset the link using some tools instead of rebooting the switch – should have the same consequence (tools like mlxlink) – they won’t work on an older switch that is based on SwitchX asic, but it should work on the CX5 side – and even then it may not release the hang you experienced (link failing to nego) in case the issue is caused by some glitch in the switch side (e.g. a state machine stuck in some deadlock etc.)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.