Need help, I’m running out of ideas!
I have a Dell M1000e blade chassis with M3601Q 40gbps Mellanox infiniband switches in I/O slot B1C1, connects to Midplane on C1. I have M910 Poweredge blades with J05yt connectX3 mezzanine card plugged. I have installed latest MLNX OFED 4.4. The OS is based on CentOS7.4 within Rocks Manzanita cluster. Since it is a blade, connection is via midplane. Switch lights are steady and good.
After following prior posts, executing the commands such as ibhosts, ibstat, lspci | grep Mell, lspci -Qvvs 07:00.0, ifcong -a, HCA_self_test.ofed, and mstflint -d 07:00.0 q, the best I can tell is my port is down/Initializing and I have subnet manager issue. I cannot get it Active or an IP show. Can you please help me diagnose? I’ll post some needed output, let me know what else is required.
Thank you much!
[root@headnode /]# hca_self_test.ofed
---- Performing Adapter Device Self Test ----
Number of CAs Detected … 2
PCI Device Check … PASS
Kernel Arch … x86_64
Host Driver Version … MLNX_OFED_LINUX-4.4-2.0.7.0 (OFED-4.4-2.0.7): 3.10.0-693.el7.x86_64
Host Driver RPM Check … PASS
Firmware on CA #0 HCA … v2.10.2132
Firmware on CA #1 HCA … v2.10.2132
Host Driver Initialization … PASS
Number of CA Ports Active … 0
Port State of Port #1 on CA #0 (HCA)… DOWN (InfiniBand)
Port State of Port #2 on CA #0 (HCA)… DOWN (InfiniBand)
Port State of Port #1 on CA #1 (HCA)… INIT (InfiniBand)
Port State of Port #2 on CA #1 (HCA)… DOWN (InfiniBand)
Error Counter Check on CA #0 (HCA)… FAIL
REASON: found errors in the following counters
Errors in /sys/class/infiniband/mlx4_0/ports/1/counters
link_error_recovery: 93
symbol_error: 65535
Error Counter Check on CA #1 (HCA)… PASS
Kernel Syslog Check … PASS
Node GUID on CA #0 (HCA) … 00:02:c9:03:00:f9:2e:80
Node GUID on CA #1 (HCA) … 00:02:c9:03:00:f9:32:f0
------------------ DONE ---------------------
[root@headnode /]# ibhosts
Ca : 0x0002c90300f92e80 ports 2 “headnode HCA-1”
[root@headnode /]# ibstat
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2132
Hardware version: 0
Node GUID: 0x0002c90300f92e80
System image GUID: 0x0002c90300f92e83
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f92e81
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f92e82
Link layer: InfiniBand
CA ‘mlx4_1’
CA type: MT4099
Number of ports: 2
Firmware version: 2.10.2132
Hardware version: 0
Node GUID: 0x0002c90300f932f0
System image GUID: 0x0002c90300f932f3
Port 1:
State: Initializing
Physical state: LinkUp
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f932f1
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Polling
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0x0002c90300f932f2
Link layer: InfiniBand
[root@headnode /]#