ConnectX-4 CX456A does not work with opensm

I have two servers each installed with a ConnectX-4 VPI 100Gb NIC (model:CX456A,two ports). The two ports are connected back to back using two copper cable. I have no problem when the two ports are set to Ethernet mode. The performance is quite close to 100Gb/s. To try the InfiniBand mode, I turn port one into InfiniBand Mode and restart the servers.

ibv_info shows the following:

hca_id: mlx5_0

transport: InfiniBand (0)

fw_ver: 12.14.2036

node_guid: 7cfe:9003:0032:797a

sys_image_guid: 7cfe:9003:0032:797a

vendor_id: 0x02c9

vendor_part_id: 4115

hw_ver: 0x0

board_id: MT_2190110032

phys_port_cnt: 1

Device ports:

port: 1

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 65535

port_lmc: 0x00

link_layer: InfiniBand

Then I started the opensm daemon(service opensmd start) on one of the servers, but it seems the opensm has problem setting the LID of my card:

Mar 09 15:06:48 031794 [1D22700] 0x03 → OpenSM 4.6.1.MLNX20160112.774e977

Mar 09 15:06:48 031842 [1D22700] 0x80 → OpenSM 4.6.1.MLNX20160112.774e977

Mar 09 15:06:48 032470 [1D22700] 0x02 → osm_vendor_init: 1000 pending umads specified

Mar 09 15:06:48 032516 [1D22700] 0x02 → osm_vendor_init: 1000 pending umads specified

Mar 09 15:06:48 051285 [1D22700] 0x80 → Entering DISCOVERING state

Mar 09 15:06:48 051416 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 086916 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 121806 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 121939 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 122094 [1D22700] 0x02 → osm_opensm_bind: Setting IS_SM on port 0x7cfe90030032797a

Mar 09 15:06:48 123326 [FF0F1700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:06:48 123690 [EE6D0700] 0x80 → SM port is down

Mar 09 15:06:58 052236 [FE0EF700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:08 052293 [FC0EB700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:18 052465 [FB8EA700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:28 052535 [F88E4700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:38 052566 [FF8F2700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:48 052771 [FE8F0700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:58 052805 [FC8EC700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:08:08 125373 [1D22700] 0x80 → Exiting SM

I tried this sever times it is always like that. I googled around but can’t find use information. Could you please give a hint what else should I do to find the reason?

Thank you so much!

Thank you Sophie, here is the result I get:

SB7700-IB-100Gb [standalone: master] # show interface ib 1/1 transceiver

IB1/1 state:

Unknown cable.

identifier : (0x11)

cable/ module type : -

infiniband speeds : -

vendor : -

cable length : -

part number : -

revision : -

serial number : -

SB7700-IB-100Gb [standalone: master] # show interface ib 1/2 transceiver

IB1/2 state:

Cable is not present.

identifier : -

cable/ module type : -

infiniband speeds : -

vendor : -

cable length : -

part number : -

revision : -

serial number : -

SB7700-IB-100Gb [standalone: master] # show interface ib 1/3 transceiver

IB1/3 state:

Unknown cable.

identifier : (0x11)

cable/ module type : -

infiniband speeds : -

vendor : -

cable length : -

part number : -

revision : -

serial number : -

============================================================

My cables are connected with SB7700 1 and 3. Port 2 is empty.

I also tried a back-to-back loop connection with two ports configure to IB mode. The link won’t get up either:

CA ‘mlx5_0’

CA type: MT4115

Number of ports: 1

Firmware version: 12.14.2036

Hardware version: 0

Node GUID: 0x7cfe90030032797a

System image GUID: 0x7cfe90030032797a

Port 1:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 65535

LMC: 0

SM lid: 0

Capability mask: 0x2651e84a

Port GUID: 0x7cfe90030032797a

Link layer: InfiniBand

CA ‘mlx5_1’

CA type: MT4115

Number of ports: 1

Firmware version: 12.14.2036

Hardware version: 0

Node GUID: 0x7cfe90030032797b

System image GUID: 0x7cfe90030032797a

Port 1:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 65535

LMC: 0

SM lid: 0

Capability mask: 0x2651e848

Port GUID: 0x7cfe90030032797b

Link layer: InfiniBand

It seems that the SB7700 switch complains about the calbe model, which I’m using MCP1600. Should I use a different cable for IB?

Thank you Eddie for the thoughts, I’m sure the physical link is corerctly linked up because the Ethernet mode is working without touching the hardware.

Sophie, What we have are MCP1600-C002 cables,

MCP1600-C002 Mellanox® Passive Copper cable, ETH1 100GbE, 100Gb/s, QSFP, LSZH, 2m

Actually we need MCP1600-E002 cables to support both IB and ETH. Is that correct?

MCP1600-E002 Mellanox® Passive Copper cable, VPI2 , up to 100Gb/s, QSFP, LSZH, 2m

I think you have ethernet cable that can’t support IB mode. Could you check cable model in CLI?

I also tried it with SB7700 IB switch. The configuration shows that the subnet manager is enabled:

=================================================================

SB7700-IB-100Gb [standalone: master] (config) # show ib sm subnet-prefix

FE:80:00:00:00:00:00:00

SB7700-IB-100Gb [standalone: master] (config) # show ib sm sweep-interval

10 seconds

SB7700-IB-100Gb [standalone: master] (config) # show ib sm sweep-on-trap

enable

SB7700-IB-100Gb [standalone: master] (config) # show ib sm

enable

=================================================================

However, it didn’t detected the connection on port 1 and 3:

=================================================================

SB7700-IB-100Gb [standalone: master] (config) # show interface ib status

Interface Description Speed Current line rate Logical port state Physical port state


IB1/1 - - Down Polling

IB1/2 - - Down Polling

IB1/3 - - Down Polling

IB1/4 - - Down Polling

IB1/5 - - Down Polling

IB1/6 - - Down Polling

IB1/7 - - Down Polling

IB1/8 - - Down Polling

IB1/9 - - Down Polling

IB1/10 - - Down Polling

IB1/11 - - Down Polling

IB1/12 - - Down Polling

IB1/13 - - Down Polling

IB1/14 - - Down Polling

IB1/15 - - Down Polling

IB1/16 - - Down Polling

IB1/17 - - Down Polling

IB1/18 - - Down Polling

IB1/19 - - Down Polling

IB1/20 - - Down Polling

  • Down Polling

IB1/22 - - Down Polling

IB1/23 - - Down Polling

IB1/24 - - Down Polling

IB1/25 - - Down Polling

IB1/26 - - Down Polling

IB1/27 - - Down Polling

IB1/28 - - Down Polling

IB1/29 - - Down Polling

IB1/30 - - Down Polling

IB1/31 - - Down Polling

IB1/32 - - Down Polling

IB1/33 - - Down Polling

IB1/34 - - Down Polling

IB1/35 - - Down Polling

IB1/36 - - Down Polling

===========================================================================

my 2c,

the issue is not with the subnet manger, issue is that the physical link between the 2 servers (in the b2b setup) or between the servers to the switch (in the switch setup) is not linking up → subnet manager is responsible for the logical side of thing but physical links should be up before.

Hi Weijia,

Can you please provide from the switch the following outputs:

show interface ib 1/1 transceiver

show interface ib 1/2 transceiver

show images

Can you also change the second port to IB and do a loopback test and check if the link comes online.

If so, try to do a back to back test between the servers using port 2 this time as IB.

Thank you,

Sophie.

Thanks Jae-Hoon,

The cable model is MCP1600-C002, it seems that it can only support Ethernet.

MCP1600-C002 Mellanox® Passive Copper cable, ETH1 100GbE, 100Gb/s, QSFP, LSZH, 2m

I guess we need MCP1600-E002 Mellanox® Passive Copper cable, VPI2 , up to 100Gb/s, QSFP, LSZH, 2m. So that it can support Both Ethernet and InfiniBand.