Hello, thanks for your replies ! Sorry, forgot about the topic.
Here is some IB adapter info for one of problematic nodes, other nodes are the same (4 are problematic with identical symptoms, other 6 are good).
-bash-4.1$ lspci | grep Mellanox
02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
-bash-4.1$ ibstat
CA ‘mlx4_0’
CA type: MT26428
Number of ports: 1
Firmware version: 2.9.1000
Hardware version: b0
Node GUID: 0x0010dc56000030bc
System image GUID: 0x0010dc56000030bf
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 12
LMC: 0
SM lid: 13
Capability mask: 0x02510868
Port GUID: 0x0010dc56000030bd
Link layer: InfiniBand
-bash-4.1$ ibportstate 13 1
CA PortInfo:
Port info: Lid 13 port 1
LinkState:…Active
PhysLinkState:…LinkUp
Lid:…13
SMLid:…13
LMC:…0
LinkWidthSupported:…4X (IBA extension)
LinkWidthEnabled:…4X
LinkWidthActive:…4X
LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:…10.0 Gbps
LinkSpeedExtSupported:…14.0625 Gbps
LinkSpeedExtEnabled:…14.0625 Gbps
LinkSpeedExtActive:…No Extended Speed
Extended Port info: Lid 13 port 1
StateChangeEnable:…0x00
LinkSpeedSupported:…0x01
LinkSpeedEnabled:…0x01
LinkSpeedActive:…0x00
Driver version from ofed_info -s: MLNX_OFED_LINUX-1.5.3-3.1.0
We tried other driver and external Infiniband card (with some older FW), the result is the same.
Fresh errors from today’s night, calculation crashed at around 4:00.
Here is the output from my script that was running, this is exactly the moment an error appeared:
[=] Checking for errors, control period is [60 s], started at [03:49:25 18/09/2019]
(Here is the normal situation)
[!] Errors reported for the switch port [18] connected to node [master]
PortRcvSwitchRelayErrors:…42
[!] Errors reported for the entire switch (summary)
PortRcvSwitchRelayErrors:…42
[=] Checking for errors, control period is [60 s], started at [04:03:40 18/09/2019]
(Here is an error)
[!] Errors reported for the switch port [1] connected to node [n01]
SymbolErrorCounter:…65535
LinkDownedCounter:…1
PortXmitDiscards:…1
[!] Errors reported for the switch port [2] connected to node [n02]
PortRcvSwitchRelayErrors:…7
[!] Errors reported for the switch port [18] connected to node [master]
PortRcvSwitchRelayErrors:…48
[!] Errors reported for the entire switch (summary)
SymbolErrorCounter:…65535
LinkDownedCounter:…1
PortRcvSwitchRelayErrors:…55
PortXmitDiscards:…1
Please note that PortRcvSwitchRelayErrors on master are normal, it’s not in Torque calculation queue so doesn’t affect calculation. Node n01 is problematic, n02 was not associated with any errors up to now (and, when there are only good nodes in calculation, no errors except for master are displayed).
================================
Sophie, I think it’s very rare that we have a support contract. We bought the cluster with all on-board Infiniband chips from HPC developer company in our country, not discrete controllers directly from Mellanox.