Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

Need help, I’m running out of ideas!

I have a Dell M1000e blade chassis with M3601Q 40gbps Mellanox infiniband switches in I/O slot B1C1, connects to Midplane on C1. I have M910 Poweredge blades with J05yt connectX3 mezzanine card plugged. I have installed latest MLNX OFED 4.4. The OS is based on CentOS7.4 within Rocks Manzanita cluster. Since it is a blade, connection is via midplane. Switch lights are steady and good.

After following prior posts, executing the commands such as ibhosts, ibstat, lspci | grep Mell, lspci -Qvvs 07:00.0, ifcong -a, HCA_self_test.ofed, and mstflint -d 07:00.0 q, the best I can tell is my port is down/Initializing and I have subnet manager issue. I cannot get it Active or an IP show. Can you please help me diagnose? I’ll post some needed output, let me know what else is required.

Thank you much!

[root@headnode /]# hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 2

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7): 3.10.0-693.el7.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v2.10.2132

Firmware on CA #1 HCA … v2.10.2132

Host Driver Initialization … PASS

Number of CA Ports Active … 0

Port State of Port #1 on CA #0 (HCA)… DOWN (InfiniBand)

Port State of Port #2 on CA #0 (HCA)… DOWN (InfiniBand)

Port State of Port #1 on CA #1 (HCA)… INIT (InfiniBand)

Port State of Port #2 on CA #1 (HCA)… DOWN (InfiniBand)

Error Counter Check on CA #0 (HCA)… FAIL

REASON: found errors in the following counters

Errors in /sys/class/infiniband/mlx4_0/ports/1/counters

link_error_recovery: 93

symbol_error: 65535

Error Counter Check on CA #1 (HCA)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (HCA) … 00:02:c9:03:00:f9:2e:80

Node GUID on CA #1 (HCA) … 00:02:c9:03:00:f9:32:f0

------------------ DONE ---------------------

[root@headnode /]# ibhosts

Ca : 0x0002c90300f92e80 ports 2 “headnode HCA-1”

[root@headnode /]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.10.2132

Hardware version: 0

Node GUID: 0x0002c90300f92e80

System image GUID: 0x0002c90300f92e83

Port 1:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f92e81

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f92e82

Link layer: InfiniBand

CA ‘mlx4_1’

CA type: MT4099

Number of ports: 2

Firmware version: 2.10.2132

Hardware version: 0

Node GUID: 0x0002c90300f932f0

System image GUID: 0x0002c90300f932f3

Port 1:

State: Initializing

Physical state: LinkUp

Rate: 40

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f932f1

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f932f2

Link layer: InfiniBand

[root@headnode /]#

Question continued…

[root@headnode /]# mstflint -d 05:00.0 q

Image type: FS2

FW Version: 2.10.2132

Device ID: 4099

Description: Node Port1 Port2 Sys image

GUIDs: 0002c90300f92e80 0002c90300f92e81 0002c90300f92e82 0002c90300f92e83

MACs: 000000000000 000000000000

VSD:

PSID: DEL0A10210018

[root@headnode /]# lspci -Qvvs 05:00.0

05:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Subsystem: Mellanox Technologies ConnectX-3 IB QDR Dual Port Mezzanine Card

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-

Latency: 0, Cache Line Size: 64 bytes

Interrupt: pin A routed to IRQ 34

Region 0: Memory at fb100000 (64-bit, non-prefetchable) [size=1M]

Region 2: Memory at f4800000 (64-bit, prefetchable) [size=8M]

Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [48] Vital Product Data

Product Name: DELL ConnectX-3 Mezz

Read-only fields:

[PN] Part number: 0J05YT

[EC] Engineering changes: A00

[SN] Serial number: IL0J05YT7403125S000Q

[V0] Vendor specific: DDR/QDR SFF mezz

[RV] Reserved: checksum good, 0 byte(s) reserved

Read/write fields:

[V1] Vendor specific: N/A

[YA] Asset tag: N/A

[RW] Read-write area: 107 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 253 byte(s) free

[RW] Read-write area: 252 byte(s) free

End

Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-

Vector table: BAR=0 offset=0007c000

PBA: BAR=0 offset=0007d000

Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 116.000W

DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+

RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

MaxPayload 256 bytes, MaxReadReq 512 bytes

DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

LnkCap: Port #8, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s unlimited, L1 unlimited

ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+

ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled

LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance De-emphasis: -6dB

Question continued…

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

ARICap: MFVC- ACS-, Next Function: 0

ARICtl: MFVC- ACS-, Function Group: 0

Capabilities: [148 v1] Device Serial Number 00-02-c9-03-00-f9-2e-80

Capabilities: [154 v2] Advanced Error Reporting

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-

CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+

AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-

Capabilities: [18c v1] #19

Kernel driver in use: mlx4_core

Kernel modules: mlx4_core

[root@headnode ~]# sminfo -p 1

ibwarn: [8670] _do_madrpc: recv failed: Connection timed out

ibwarn: [8670] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)

sminfo: iberror: failed: query

[root@headnode ~]# sminfo -p 2

ibwarn: [8684] _do_madrpc: recv failed: Connection timed out

ibwarn: [8684] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0)

sminfo: iberror: failed: query

[root@headnode ~]#

Question continued…

Opensm


****************** ERRORS DURING INITIALIZATION ******************


Sep 12 11:18:51 735239 [A3E15700] 0x01 → state_mgr_check_tbl_consistency: ERR 3322: lid 1 is wrongly assigned to port 0x0002c90300f92e81 (‘headnode HCA-1’ port 1) in port_lid_tbl

Sep 12 11:18:51 735367 [A3E15700] 0x02 → state_mgr_check_tbl_consistency: Clearing Lid for port 0x0002c90300f92e81

Sep 12 11:18:51 735375 [A3E15700] 0x01 → state_mgr_check_tbl_consistency: ERR 3322: lid 3 is wrongly assigned to port 0x0002c90300f932f1 (‘headnode HCA-2’ port 1) in port_lid_tbl

Sep 12 11:18:51 735392 [A3E15700] 0x02 → state_mgr_check_tbl_consistency: Clearing Lid for port 0x0002c90300f932f1

Sep 12 11:18:51 735430 [A3E15700] 0x01 → osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f932f1 (headnode HCA-2 port 1) has LID 0. An initialization error occurred. Ignoring port

Sep 12 11:18:51 735449 [A3E15700] 0x01 → osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f92e81 (headnode HCA-1 port 1) has LID 0. An initialization error occurred. Ignoring port

Sep 12 11:18:51 735462 [A3E15700] 0x01 → osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f92e81 (headnode HCA-1 port 1) has LID 0. An initialization error occurred. Ignoring port

Sep 12 11:18:51 735468 [A3E15700] 0x01 → osm_ucast_port_is_zero_lid: ERR 3A04: Port 0x2c90300f932f1 (headnode HCA-2 port 1) has LID 0. An initialization error occurred. Ignoring port

Sep 12 11:18:51 735480 [A3E15700] 0x02 → osm_ucast_mgr_process: minhop tables configured on all switches

Sep 12 11:18:51 740351 [A3E15700] 0x80 → Errors during initialization

Sep 12 11:18:51 740385 [A3E15700] 0x01 → do_sweep:

[root@headnode ~]# nmcli connection show ib0

connection.id: ib0

connection.uuid: 65aec7ac-2335-44aa-b9c2-0945379d8111

connection.stable-id: –

connection.interface-name: ib0

connection.type: infiniband

connection.autoconnect: yes

connection.autoconnect-priority: 0

connection.autoconnect-retries: -1 (default)

connection.timestamp: 0

connection.read-only: no

connection.permissions: –

connection.zone: –

connection.master: –

connection.slave-type: –

connection.autoconnect-slaves: -1 (default)

connection.secondaries: –

connection.gateway-ping-timeout: 0

connection.metered: unknown

connection.lldp: -1 (default)

ipv4.method: auto

ipv4.dns: –

ipv4.dns-search: –

ipv4.dns-options: (default)

Question continued…

ipv4.dns-priority: 0

ipv4.addresses: –

ipv4.gateway: –

ipv4.routes: –

ipv4.route-metric: -1

ipv4.ignore-auto-routes: no

ipv4.ignore-auto-dns: no

ipv4.dhcp-client-id: –

ipv4.dhcp-timeout: 0

ipv4.dhcp-send-hostname: yes

ipv4.dhcp-hostname: –

ipv4.dhcp-fqdn: –

ipv4.never-default: yes

ipv4.may-fail: yes

ipv4.dad-timeout: -1 (default)

ipv6.method: link-local

ipv6.dns: –

ipv6.dns-search: –

ipv6.dns-options: (default)

ipv6.dns-priority: 0

ipv6.addresses: –

ipv6.gateway: –

ipv6.routes: –

ipv6.route-metric: -1

ipv6.ignore-auto-routes: no

ipv6.ignore-auto-dns: no

ipv6.never-default: no

ipv6.may-fail: yes

ipv6.ip6-privacy: 0 (disabled)

ipv6.addr-gen-mode: stable-privacy

ipv6.dhcp-send-hostname: yes

ipv6.dhcp-hostname: –

ipv6.token: –

infiniband.mac-address: 80:00:02:08:FE:80:00:00:00:00:00:00:00:02:C9:03:00:F9:32:F1

infiniband.mtu: auto

infiniband.transport-mode: connected

infiniband.p-key: default

infiniband.parent: –

proxy.method: none

proxy.browser-only: no

proxy.pac-url: –

proxy.pac-script: –

Re: Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

Thank you, appreciate the help! I’ll work on this today and report. I’m using an older infiniband QDR switch M3601Q than the new M4001Q. I do know the standard firmware update failed when I was installing OFED, had to do a force install.

reply…

I had good progress following answers here! Thank you. I created a opensm conf file as suggested. The firmware is now updated to the latest 2.36.5000.

The latest Mlnx OFED 4.4 had issues, actually it seemed to install OK, but no ib commands worked. I uninstalled it and reinstalled MLNX OFED 4.2-1.2.0.0, the last compatible version of RHEL/CentOS7.4. The version 3.4 is incompatible with my version of CentOS7 on Rocks Cluster 7.

I have to start opensm from terminal, is there a way to start it on boot perhaps from conf file? Another question is regarding GUID, when I replace default GUID, should I use active port GUID or node? I tried both. My output is below, appreciate the help! I also notice ib0 is not green using # nmcli connection show. This is now a network issue perhaps?

[root@headnode ~]# mlxfwmanager --online -u -d 07:00.0

Querying Mellanox devices firmware …

Device #1:


Device Type: ConnectX3

Part Number: 0J05YT_Bx

Description: MCX380A-QCAA ConnectX-3 Dual-port QDR Mezzanine I/O Card

PSID: DEL0A10210018

PCI Device Name: 07:00.0

Port1 GUID: 0002c90300f932f1

Port2 GUID: 0002c90300f932f2

Versions: Current Available

FW 2.36.5000 N/A

PXE 3.4.0718 N/A

Status: No matching image found

[root@headnode ~]# /etc/init.d/opensmd status

opensm is stopped

[root@headnode ~]# /etc/init.d/opensmd start

Starting opensmd (via systemctl): [ OK ]

[root@headnode ~]# ibstat

CA ‘mlx4_0’

CA type: MT4099

Number of ports: 2

Firmware version: 2.36.5000

Hardware version: 1

Node GUID: 0x0002c90300f932f0

System image GUID: 0x0002c90300f932f3

Port 1:

State: Active

Physical state: LinkUp

Rate: 40

Base lid: 1

LMC: 0

SM lid: 1

Capability mask: 0x0251486a

Port GUID: 0x0002c90300f932f1

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02514868

Port GUID: 0x0002c90300f932f2

Link layer: InfiniBand

reply…

[root@headnode ~]# ibhosts

Ca : 0x0002c90300f932f0 ports 2 “headnode HCA-1”

[root@headnode ~]# hca_self_test.ofed

---- Performing Adapter Device Self Test ----

Number of CAs Detected … 1

PCI Device Check … PASS

Kernel Arch … x86_64

Host Driver Version … MLNX_OFED_LINUX-4.2-1.2.0.0 (OFED-4.2-1.2.0): 3.10.0-693.el7.x86_64

Host Driver RPM Check … PASS

Firmware on CA #0 HCA … v2.36.5000

Host Driver Initialization … PASS

Number of CA Ports Active … 1

Port State of Port #1 on CA #0 (HCA)… UP 4X QDR (InfiniBand)

Port State of Port #2 on CA #0 (HCA)… DOWN (InfiniBand)

Error Counter Check on CA #0 (HCA)… PASS

Kernel Syslog Check … PASS

Node GUID on CA #0 (HCA) … 00:02:c9:03:00:f9:32:f0

------------------ DONE ---------------------

[root@headnode ~]# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.36.5000

node_guid: 0002:c903:00f9:32f0

sys_image_guid: 0002:c903:00f9:32f3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x1

board_id: DEL0A10210018

phys_port_cnt: 2

Device ports:

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

[root@headnode ~]# nmcli connection show

NAME UUID TYPE DEVICE

Bridge em1 1dad842d-1912-ef5a-a43a-bc238fb267e7 bridge em1

Bridge em2 0578038a-64e9-a2fd-0a28-e4cd0b553930 bridge em2

System pem1 c19149d5-4e53-4636-b52a-81d213a8a3cb 802-3-ethernet pem1

Wired connection 1 13bddd27-08a5-45b5-bd3d-82081536eedd 802-3-ethernet pem2

virbr0 dc113ed9-ff0e-45ae-85e1-3cd724eea69f bridge virbr0

System pem2 7379072d-ea75-335e-2486-0afa3cd10c77 802-3-ethernet –

ib0 6b15b69c-4a0b-4457-9db3-183140b4cbe4 infiniband –

ib1 a1fe6e6b-9dc1-4e47-9478-2f0c7ea6b1d3 infiniband –

Re: Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

Makes sense, thank you. I have downloaded everything I might need. The PSID DEL0A10210018 doesn’t have a match, since the switch is older, picked couple to try. The firmware link for M3601Q switch leads to the newest OEM link you shared as well.

I’ll give it all a good shot and report through weekend. Hoping this does the magic!

Re: Dell M1000e blade server, InfiniBand QDR subnet issue, OFED 4.4, opensm initialization error!

Well, it was one busy weekend troubleshooting and a lot of work. I may have solved few issues but it is not perfect yet!

The OEM updates (tried few) would not work because of PSID mistmatch, if there is a work around, please let me know. I’m not able to find any firmware online for PSID of the switch M3601Q.

[root@headnode Infini Switch firmware]# ls

fw-sx-9_2_8000-0269NG_B1.bin

[root@headnode Infini Switch firmware]# lspci | grep Mellanox

07:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]

[root@headnode Infini Switch firmware]# mstflint -d 07:00.0 -i fw-sx-9_2_8000-0269NG_B1.bin b

Current FW version on flash: 2.10.2132

New FW version: 9.2.8000

-E- PSID mismatch. The PSID on flash (DEL0A10210018) differs from the PSID in the given image (DEL09E0210003).

[root@headnode Infini Switch firmware]#

I tried forcing GUID through command line as suggested as I don’t have a opensm.conf file anywhere.

Then I went ahead and uninstalled Mellanox OFED and started with Open Fabrics OFED. There were few missing errors (cmake, libnl3-devel, numactl-devel, devel-grind), after getting those rpm’s and dependencies all sorted, it did install. The Port GUID did recognize and infiniband is active. DHCP didn’t do it, so I set it up as manual, may not be perfect yet. The issues lingering now are OFED related, I cant seem to get opensm run auto, it has to be started with #/etc/init.d/opensmd start. After starting it, ibv_devinfo and nmcli connection show gives:

[root@headnode ~]# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.10.2132

node_guid: 0002:c903:00f9:32f0

sys_image_guid: 0002:c903:00f9:32f3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x0

board_id: DEL0A10210018

phys_port_cnt: 2

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

continued…

[root@headnode ~]# nmcli connection show

NAME UUID TYPE DEVICE

Wired connection 2 a40b3b41-66e7-3d87-a77c-e79ccd002698 802-3-ethernet em1

Wired connection 3 7b5a96ce-3df4-3534-8a35-b430f3f1e3e5 802-3-ethernet em2

ib0 b4fdfa83-45ba-4904-a8ec-377234b898ee infiniband ib0

virbr0 d36acaba-3663-4199-ae03-0b2a39aa75df bridge virbr0

Bridge em1 1dad842d-1912-ef5a-a43a-bc238fb267e7 bridge –

Bridge em2 0578038a-64e9-a2fd-0a28-e4cd0b553930 bridge –

System ib0 2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89 infiniband –

System pem1 c19149d5-4e53-4636-b52a-81d213a8a3cb 802-3-ethernet –

System pem2 7379072d-ea75-335e-2486-0afa3cd10c77 802-3-ethernet –

Wired connection 1 d4070b38-e850-4a48-83a7-223ecca993f7 802-3-ethernet –

ib0 4e22b1f1-3e0c-4b84-b0d9-85b0755728ac infiniband –

ib0 152321c5-8ba1-4865-9eca-5a18a889ffb7 infiniband –

ib1 9fd439a6-da5e-4928-9265-47a636b3aaea infiniband –

#ifconfig -a ib0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520

inet 10.1.27.7 netmask 255.0.0.0 broadcast 10.1.77.77

inet6 fe80::202:c903:f9:32f1 prefixlen 64 scopeid 0x20

Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).

infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

RX packets 0 bytes 0 (0.0 B)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 289 bytes 19652 (19.1 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Next resolution: I’m waiting on two Dell flash SD’s for CMC, so I can get all drivers updated on the chassis and nodes. It is a lot slower through UEFI and some drivers are too big anyway. Hopefully the I/O update may help! Next, I may do a fresh install of Rocks Cluster 7 (Manzanita) and try the prior versions of Mellanox OFED such as 4.1 or 3.xx. I can come back to OFED as well.

Issues persisting: The commands ibstat, ibhosts, etc. of OFED do not work, perhaps a failure on OFED side. The ib0 still shows hardware error, perhaps firmware issue. HCA test command do not work, but seems good as port is active. I have a different issue of Rocks Clusters command “insert-ethers” non responding to connect the switch and compute nodes, hence the reinstall.

Sorry, seems like a mess, thank you for the time! I know I’ll get around it one way or the other, may even have to buy a newer m4001 switch that has current drivers. Wonder if Mellanox will share an archive m3601q firmware?

reply…

Thank you! I totally missed checking the adapters! I did find the correct file needed for the switch PSID. Also, appreciate the explanation on opensm config. I believe my flash cards for CMC will be delivered tomorrow, will start of updates (Bios, I/O, etc.) first, install frontend Rocks and then tackle MOFED install (3.4 works), I think this will get moving…!

reply…

Thank you, the Mellanox user manual has a wealth of information on OpenSM. I’ll check settings and create/check log files. I’ll revert back to the active port GUID.

Hi Joel,

When running OEM HCA, using Mellanox firmware is not supported and Mellanox OFED has no firmware images for OEM cards. Hopefully, you didn’t burn it and that’s why there is a link to Dell archive.

Check if using latest firmware - 2.36.5000 for you device helps? 2.10.XXXX is extremely outdated. http://www.mellanox.com/page/firmware_table_dell_archive http://www.mellanox.com/page/firmware_table_dell_archive

As additional step, after installing 2.36 firmware, check if using MOFED-4.0 (or even 3.4 or using Inbox Infiniband package) makes the issue go away. Try to explicitly specify guid in opensm.conf configuration file or on the command line (opensm --guid )

Thank you, appreciate the help! I’ll work on this today and report. I’m using an older infiniband QDR switch M3601Q than the new M4001Q. I do know the standard firmware update failed when I was installing OFED, had to do a force install.

http://www.mellanox.com/page/firmware_table_dell_archive http://www.mellanox.com/page/firmware_table_dell_archive

Check in adapters for your PSID, 2.36.5000 is the latest. The MOFED version that should work with it is v3.4, the latest may work too, however it requires much newer firmware.

You can generate opensm.conf file by changing to /etc/opensm directory and execute

#opensm -c opensm.conf

The default configuration file has the same options as opensm that runs with no configuration file. On beginning of the file you’ll a ‘guid’ option

Makes sense, thank you. I have downloaded everything I might need. The PSID DEL0A10210018 doesn’t have a match, since the switch is older, picked couple to try. The firmware link for M3601Q switch http://www.mellanox.com/page/firmware_burning_dell/#IS4_FW_burn leads to the newest OEM link you shared as well.

I’ll give it all a good shot and report through weekend. Hoping this does the magic!

Well, it was one busy weekend troubleshooting and a lot of work. I may have solved few issues but it is not perfect yet!

The OEM updates (tried few) would not work because of PSID mistmatch, if there is a work around, please let me know. I’m not able to find any firmware online for PSID of the switch M3601Q.

[root@headnode Infini Switch firmware]# ls

fw-sx-9_2_8000-0269NG_B1.bin

[root@headnode Infini Switch firmware]# lspci | grep Mellanox

07:00.0 Infiniband controller: Mellanox Technologies MT27500 Family [ConnectX-3]

[root@headnode Infini Switch firmware]# mstflint -d 07:00.0 -i fw-sx-9_2_8000-0269NG_B1.bin b

Current FW version on flash: 2.10.2132

New FW version: 9.2.8000

-E- PSID mismatch. The PSID on flash (DEL0A10210018) differs from the PSID in the given image (DEL09E0210003).

[root@headnode Infini Switch firmware]#

I tried forcing GUID through command line as suggested as I don’t have a opensm.conf file anywhere.

Then I went ahead and uninstalled Mellanox OFED and started with Open Fabrics OFED. There were few missing errors (cmake, libnl3-devel, numactl-devel, devel-grind), after getting those rpm’s and dependencies all sorted, it did install. The Port GUID did recognize and infiniband is active. DHCP didn’t do it, so I set it up as manual, may not be perfect yet. The issues lingering now are OFED related, I cant seem to get opensm run auto, it has to be started with #/etc/init.d/opensmd start. After starting it, ibv_devinfo and nmcli connection show gives:

[root@headnode ~]# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.10.2132

node_guid: 0002:c903:00f9:32f0

sys_image_guid: 0002:c903:00f9:32f3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x0

board_id: DEL0A10210018

phys_port_cnt: 2

port: 1

state: PORT_ACTIVE (4)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 4096 (5)

active_mtu: 4096 (5)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

[root@headnode ~]# nmcli connection show

NAME UUID TYPE DEVICE

Wired connection 2 a40b3b41-66e7-3d87-a77c-e79ccd002698 802-3-ethernet em1

Wired connection 3 7b5a96ce-3df4-3534-8a35-b430f3f1e3e5 802-3-ethernet em2

ib0 b4fdfa83-45ba-4904-a8ec-377234b898ee infiniband ib0

virbr0 d36acaba-3663-4199-ae03-0b2a39aa75df bridge virbr0

Bridge em1 1dad842d-1912-ef5a-a43a-bc238fb267e7 bridge –

Bridge em2 0578038a-64e9-a2fd-0a28-e4cd0b553930 bridge –

System ib0 2ab4abde-b8a5-6cbc-19b1-2bfb193e4e89 infiniband –

System pem1 c19149d5-4e53-4636-b52a-81d213a8a3cb 802-3-ethernet –

System pem2 7379072d-ea75-335e-2486-0afa3cd10c77 802-3-ethernet –

Wired connection 1 d4070b38-e850-4a48-83a7-223ecca993f7 802-3-ethernet –

ib0 4e22b1f1-3e0c-4b84-b0d9-85b0755728ac infiniband –

ib0 152321c5-8ba1-4865-9eca-5a18a889ffb7 infiniband –

ib1 9fd439a6-da5e-4928-9265-47a636b3aaea infiniband –

#ifconfig -a ib0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65520

inet 10.1.27.7 netmask 255.0.0.0 broadcast 10.1.77.77

inet6 fe80::202:c903:f9:32f1 prefixlen 64 scopeid 0x20

Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).

infiniband 80:00:02:08:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

RX packets 0 bytes 0 (0.0 B)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 289 bytes 19652 (19.1 KiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Next resolution: I’m waiting on two Dell flash SD’s for CMC, so I can get all drivers updated on the chassis and nodes. It is a lot slower through UEFI and some drivers are too big anyway. Hopefully the I/O update may help! Next, I may do a fresh install of Rocks Cluster 7 (Manzanita) and try the prior versions of Mellanox OFED such as 4.1 or 3.xx. I can come back to OFED as well.

Issues persisting: The commands ibstat, ibhosts, etc. of OFED do not work, perhaps a failure on OFED side. The ib0 still shows hardware error, perhaps firmware issue. HCA test command do not work, but seems good as port is active. I have a different issue of Rocks Clusters command “insert-ethers” non responding to connect the switch and compute nodes, hence the reinstall.

Sorry, seems like a mess, thank you for the time! I know I’ll get around it one way or the other, may even have to buy a newer m4001 switch that has current drivers. Wonder if Mellanox will share an archive m3601q firmware?

Thank you! I totally missed checking the adapters! I did find the correct file needed for the switch PSID. Also, appreciate the explanation on opensm config. I believe my flash cards for CMC will be delivered tomorrow, will start of updates (Bios, I/O, etc.) first, install frontend Rocks and then tackle MOFED install (3.4 works), I think this will get moving…!