Hello,
we have a bunch of old hardware, which still needs running (common scenario ;)
Last week the HPC cluster was rebuild - aka all machines out of the rack, some old ones to garbage, some old ones reused in the rack.
Just the main facts in a few words, I’m eager to hear your tips and hints:
Before:
computenode with MT25408A0 was connected to an IB switch SX6012. worked fine.
After
We now have a 40port HDR switch MQM8700 and the MT25408A0 device remains in state ‘Polling’
Technical Details:
[root@w6 ~]# lsb_release -d
Description: CentOS Linux release 7.9.2009 (Core)
[root@w6 ~]# rpm -qf /usr/sbin/ibstatus
infiniband-diags-2.1.0-1.el7.x86_64
[root@w6 ~]# lspci -v
…
06:00.0 InfiniBand: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s In… (rev b0)
Subsystem: Mellanox Technologies MT25408A0-FCC-QI ConnectX, Dual Port 40Gb/s InfiniBand / 10GigE Adapter IC with PCIe 2.0 x8 5.0GT/s Interface
Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
Memory at cf300000 (64-bit, non-prefetchable) [size=1M]
Memory at c2800000 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 3
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [100] Alternative Routing-ID Interpretation (ARI)
Capabilities: [148] Device Serial Number 00-02-c9-03-00-28-86-f6
Kernel driver in use: mlx4_core
Kernel modules: mlx4_core
…
[root@w6 ~]# ibstatus| egrep ‘device|state|rate’; ibstat|egrep ‘CA|Firmware|Hardware|State|Rate’
Infiniband device ‘mlx4_0’ port 1 status:
state: 1: DOWN
phys state: 2: Polling
rate: 10 Gb/sec (4X)
CA ‘mlx4_0’
CA type: MT26428
Firmware version: 2.9.1000
Hardware version: b0
State: Down
Rate: 10
If we connect the node to the old SX6012 switch, which is connected also with the new switch, it works. But we want to get rid of the old switch.
So how could we proceed?
Best regards
Joe