Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

I had good progress following answers here! Thank you. I created a opensm conf file as suggested. The firmware is now updated to the latest 2.36.5000.

The latest Mlnx OFED 4.4 had issues, actually it seemed to install OK, but no ib commands worked. I uninstalled it and reinstalled MLNX OFED 4.2-1.2.0.0, the last compatible version of RHEL/CentOS7.4. The version 3.4 is incompatible with my version of CentOS7 on Rocks Cluster 7.

I have to start opensm from terminal, is there a way to start it on boot perhaps from conf file? Another question is regarding GUID, when I replace default GUID, should I use active port GUID or node? I tried both. My output is below, appreciate the help! I also notice ib0 is not green using # nmcli connection show. This is now a network issue perhaps?

[root@headnode ~]# mlxfwmanager --online -u -d 07:00.0

Querying Mellanox devices firmware …

Device #1:


Device Type: ConnectX3

Part Number: 0J05YT_Bx

Description: MCX380A-QCAA ConnectX-3 Dual-port QDR Mezzanine I/O Card

PSID: DEL0A10210018

PCI Device Name: 07:00.0

Port1 GUID: 0002c90300f932f1

Port2 GUID: 0002c90300f932f2

Versions: Current Available

FW 2.36.5000 N/A

PXE 3.4.0718 N/A

Status: No matching image found

[root@headnode ~]# /etc/init.d/opensmd status

opensm is stopped

[root@headnode ~]# /etc/init.d/opensmd start

Starting opensmd (via systemctl): [ OK ]

[root@headnode ~]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.36.5000

Hardware version: 1

Node GUID: 0x0002c90300f932f0

System image GUID: 0x0002c90300f932f3

Port 1:

State: Active

Physical state: LinkUp

Rate: 40

Base lid: 1

LMC: 0

SM lid: 1

Capability mask: 0x0251486a

Port GUID: 0x0002c90300f932f1

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f932f2

Link layer: InfiniBand

[root@headnode ~]# ibhosts

Ca : 0x0002c90300f932f0 ports 2 “headnode HCA-1”

[root@headnode ~]# hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0): 3.10.0-693.el7.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v2.36.5000

Host Driver Initialization … PASS

Number of CA Ports Active … 1

Port State of Port #1 on CA #0 (HCA)… UP 4X QDR (InfiniBand)

Port State of Port #2 on CA #0 (HCA)… DOWN (InfiniBand)

Error Counter Check on CA #0 (HCA)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (HCA) … 00:02:c9:03:00:f9:32:f0

------------------ DONE ---------------------

[root@headnode ~]# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.36.5000

node_guid: 0002:c903:00f9:32f0

sys_image_guid: 0002:c903:00f9:32f3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x1

board_id: DEL0A10210018

phys_port_cnt: 2

Device ports:

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

[root@headnode ~]# nmcli connection show

NAME UUID TYPE DEVICE

Bridge em1 1dad842d-1912-ef5a-a43a-bc238fb267e7 bridge em1

Bridge em2 0578038a-64e9-a2fd-0a28-e4cd0b553930 bridge em2

System pem1 c19149d5-4e53-4636-b52a-81d213a8a3cb 802-3-ethernet pem1

Wired connection 1 13bddd27-08a5-45b5-bd3d-82081536eedd 802-3-ethernet pem2

virbr0 dc113ed9-ff0e-45ae-85e1-3cd724eea69f bridge virbr0

System pem2 7379072d-ea75-335e-2486-0afa3cd10c77 802-3-ethernet –

ib0 6b15b69c-4a0b-4457-9db3-183140b4cbe4 infiniband –

ib1 a1fe6e6b-9dc1-4e47-9478-2f0c7ea6b1d3 infiniband –

Check Mellanox OFED user manual, section 3.2.2 for additional details about Subnet Manager. It is a service, that can be enabled/disabled and by default it uses /etc/opensm/opensm.conf file.

Regarding GUID (should be Port GUID), if you have only one connected port, application will detect it and start SM on it, so keep it simple.

Thank you, the Mellanox user manual has a wealth of information on OpenSM. I’ll check settings and create/check log files. I’ll revert back to the active port GUID.