I have two servers each installed with a ConnectX-4 VPI 100Gb NIC (model:CX456A,two ports). The two ports are connected back to back using two copper cable. I have no problem when the two ports are set to Ethernet mode. The performance is quite close to 100Gb/s. To try the InfiniBand mode, I turn port one into InfiniBand Mode and restart the servers.
ibv_info shows the following:
…
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.14.2036
node_guid: 7cfe:9003:0032:797a
sys_image_guid: 7cfe:9003:0032:797a
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2190110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 65535
port_lmc: 0x00
link_layer: InfiniBand
…
Then I started the opensm daemon(service opensmd start) on one of the servers, but it seems the opensm has problem setting the LID of my card:
Mar 09 15:06:48 031794 [1D22700] 0x03 → OpenSM 4.6.1.MLNX20160112.774e977
Mar 09 15:06:48 031842 [1D22700] 0x80 → OpenSM 4.6.1.MLNX20160112.774e977
Mar 09 15:06:48 032470 [1D22700] 0x02 → osm_vendor_init: 1000 pending umads specified
Mar 09 15:06:48 032516 [1D22700] 0x02 → osm_vendor_init: 1000 pending umads specified
Mar 09 15:06:48 051285 [1D22700] 0x80 → Entering DISCOVERING state
Mar 09 15:06:48 051416 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x7cfe90030032797a
Mar 09 15:06:48 086916 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x7cfe90030032797a
Mar 09 15:06:48 121806 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x7cfe90030032797a
Mar 09 15:06:48 121939 [1D22700] 0x02 → osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x7cfe90030032797a
Mar 09 15:06:48 122094 [1D22700] 0x02 → osm_opensm_bind: Setting IS_SM on port 0x7cfe90030032797a
Mar 09 15:06:48 123326 [FF0F1700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:06:48 123690 [EE6D0700] 0x80 → SM port is down
Mar 09 15:06:58 052236 [FE0EF700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:08 052293 [FC0EB700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:18 052465 [FB8EA700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:28 052535 [F88E4700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:38 052566 [FF8F2700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:48 052771 [FE8F0700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:07:58 052805 [FC8EC700] 0x01 → pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0
Mar 09 15:08:08 125373 [1D22700] 0x80 → Exiting SM
I tried this sever times it is always like that. I googled around but can’t find use information. Could you please give a hint what else should I do to find the reason?
Thank you so much!