Hello. We have problems on our old HPC cluster.

Cluster runs CentOS 6.2 and is equipped with InfiniBand switch and ConnectX adapters. Network is very simple: some nodes, each is connected to the switch with optical cable. At first sight, all things work fine. But, at some moment, an error occures and calculation fails, at that moment “Link downed counter” becomes 1 for one of nodes. If there is no InfiniBand load or simulated load (prolonged ibv_rc_pingpong), there are no errors (I have script to monitor them). Errors appear only on particular nodes. I checked everything I can with standard software like ibdiagnet. We tried to change cables, switch ports, swapped node HDDs. No effect. Memory test (memtest86+) shows no errors. I bought similar discrete PCI ConnectX card (nodes has on-board IB chips with the same IDs in ibportstatus), we connected it to one of problematic nodes and the same error appeared after several hours of calculation.

How can I trace down the error? As I understand, transmission errors occure at the port so the switch downs the link. But how can I find the root cause? Thanks.

Hi Andrew,

A few questions:

what exact HCA type are you using (IE: ConnectX, -2, -3, -4, -5, -6)?

What is the FW?

What is the driver release (IE: ofed_info -s)?

what is or are the errors you are referring to? (IE: ibdiagnet port counters link down, link recovery, port receive errors).

Can you explain or emphasized “the same error appears after several hours of calculation”?

Does the link actually go down and recover itself?

Any warning, errors within the messages file?

Sophie.

Hi Andrew,

For ease and if you have a support contract with Mellanox, you can open a case via SupportAdmin SupportAdmin@mellanox.com. We will then collect all relevant data and we will further analyse your issue.

Sophie.

Hello, thanks for your replies ! Sorry, forgot about the topic.

Here is some IB adapter info for one of problematic nodes, other nodes are the same (4 are problematic with identical symptoms, other 6 are good).

-bash-4.1$ lspci | grep Mellanox

02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

-bash-4.1$ ibstat

CA ‘mlx4_0’

CA type: MT26428

Number of ports: 1

Firmware version: 2.9.1000

Hardware version: b0

Node GUID: 0x0010dc56000030bc

System image GUID: 0x0010dc56000030bf

Port 1:

State: Active

Physical state: LinkUp

Rate: 10

Base lid: 12

LMC: 0

SM lid: 13

Capability mask: 0x02510868

Port GUID: 0x0010dc56000030bd

Link layer: InfiniBand

-bash-4.1$ ibportstate 13 1

CA PortInfo:

Port info: Lid 13 port 1

LinkState:…Active

PhysLinkState:…LinkUp

Lid:…13

SMLid:…13

LMC:…0

LinkWidthSupported:…4X (IBA extension)

LinkWidthEnabled:…4X

LinkWidthActive:…4X

LinkSpeedSupported:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedEnabled:…2.5 Gbps or 5.0 Gbps or 10.0 Gbps

LinkSpeedActive:…10.0 Gbps

LinkSpeedExtSupported:…14.0625 Gbps

LinkSpeedExtEnabled:…14.0625 Gbps

LinkSpeedExtActive:…No Extended Speed

Extended Port info: Lid 13 port 1

StateChangeEnable:…0x00

LinkSpeedSupported:…0x01

LinkSpeedEnabled:…0x01

LinkSpeedActive:…0x00

Driver version from ofed_info -s: MLNX_OFED_LINUX-1.5.3-3.1.0

We tried other driver and external Infiniband card (with some older FW), the result is the same.

Fresh errors from today’s night, calculation crashed at around 4:00.

Here is the output from my script that was running, this is exactly the moment an error appeared:

[=] Checking for errors, control period is [60 s], started at [03:49:25 18/09/2019]

(Here is the normal situation)

[!] Errors reported for the switch port [18] connected to node [master]

PortRcvSwitchRelayErrors:…42

[!] Errors reported for the entire switch (summary)

PortRcvSwitchRelayErrors:…42

[=] Checking for errors, control period is [60 s], started at [04:03:40 18/09/2019]

(Here is an error)

[!] Errors reported for the switch port [1] connected to node [n01]

SymbolErrorCounter:…65535

LinkDownedCounter:…1

PortXmitDiscards:…1

[!] Errors reported for the switch port [2] connected to node [n02]

PortRcvSwitchRelayErrors:…7

[!] Errors reported for the switch port [18] connected to node [master]

PortRcvSwitchRelayErrors:…48

[!] Errors reported for the entire switch (summary)

SymbolErrorCounter:…65535

LinkDownedCounter:…1

PortRcvSwitchRelayErrors:…55

PortXmitDiscards:…1

Please note that PortRcvSwitchRelayErrors on master are normal, it’s not in Torque calculation queue so doesn’t affect calculation. Node n01 is problematic, n02 was not associated with any errors up to now (and, when there are only good nodes in calculation, no errors except for master are displayed).

================================

Sophie, I think it’s very rare that we have a support contract. We bought the cluster with all on-board Infiniband chips from HPC developer company in our country, not discrete controllers directly from Mellanox.

Hi Andrew,

This is extremely old not to mention EOL & EOS…

When did you first noticed this issue?

Any changes within this environment?

Does the link actually go down and recover itself?

What is the status of the link when the issue occurs ?(ibstat-ifconfig)

Does it cause traffic interruption?

Any warnings, errors within the messages file at that time?

Any warnings, errors within the switch logs?

Any warnings, errors within the application logs?

Is the issue easily reproducible and always point to the same nodes?

Did you noticed a pattern when the issue occured? (IE: time of the day, traffic load etc…)

Can you reproduce the issue if you for example run ib_write_bw in an infinite loop?

Sophie.

Hello, thanks for your attention.

This is extremely old

Yes, I know :) As a cluster itself, it’s from 2012. We think about the total update but it’s very time consuming (no automated setup script for CentOS 7 + our needs) and there is absolutely no warranty that it will improve something. We have comfortable environment on this cluster now so the only serious thing we would fix is this InfiniBand issue. Our company is not large enough to afford the completely new system.

The issue first appeared in the early days when we just started to use the machine. But it was very-very rare. After 3…4 years it became much more frequent, two nodes caused calculation crashes to often so we excluded them from Torque. Then, after some time, another two nodes began to glitch and they was also excluded so now after about 1.5 years we are on stable 6-node configuration. Recently we tweaked some BIOS settings and performed tests again with 10 nodes. The same thing with 2 nodes that had this issue, it’s clear that another 2 problematic nodes are also remain so we will return to 6-node configuration…

Any changes within this environment?

There was no any significant environment changes all these years. Yes, we added some libraries needed to run the software, added Samba for the file transfer, etc, but it’s rare that it could affect so low level.

There are particular nodes. If we place HDDs from “good” nodes to "bad "ones, the glitch appears on the same hardware (under another host name because it boots from different HDD) so it doesn’t depend on particular OS/drivers installation.

I googled the Internet for issues like this and found nothing so I don’t think that it’s an InfiniBand chip problem. 40% of problematic chips will create lots of issue posts on the Net but I found nothing. Maybe some Northbridge (PCI) problem in these nodes (like cracked BGA soldering), but other things connected to PCI (including CPU bridge and SATA controller) doesn’t cause obvious problems. These nodes doesn’t hang etc. So we just don’t have any ideas…

Does the link actually go down and recover itself?

Yes. According to my script’s log that checks error counters every minute, yesterday evening link went down on one of problematic nodes and then recovered on the next check interval (within one minute or less). Only crashed calculation and some PortXmitDiscard errors on this node’s port point to that moment, if I say ibnetdiscover -p now, it will show that all nodes are up.

What would you suggest in our situation? Is there a reason to update the firmware? What could physically cause this issue?

What about the other pointers provided:

What is the status of the link when the issue occurs ?(ibstat-ifconfig)

Does it cause traffic interruption?

Any warnings, errors within the messages file at that time?

Any warnings, errors within the switch logs?

Any warnings, errors within the application logs?

Did you noticed a pattern when the issue occured? (IE: time of the day, traffic load etc…)

Can you reproduce the issue if you for example run ib_write_bw in an infinite loop?

These HCA cards are EOL & EOS latest FW is 2.9.1000 though why would you think this is a FW issue as you have 6 stable nodes.

You have already mentioned ruling out cables and switch ports.

Did you try to move the HCA to a different slot on these particular servers?

Maybe perform some PCI diagnosis.

I would still review all the logs mentioned above for potential additional hints when the issue reproduced.

Hello.

What is the status of the link when the issue occurs ?

I believe that the status is down but I can check it simply because it recovers within one minute and we can’t check the terminal during hours or days of calculation… But, because the node name cannot be determined just at the moment error occures, and MPI cannot communicate with problematic node (I checked the calculation logs), I think, the link actually goes down for some strange reason. Here is the answer for you question about application logs. Other things encountered there are not sufficient for this issue.

The main fact is that indeed the problematic node becomes inaccessible for MPI and it’s hostname cannot be determined with ibnetdiscover (my script shows empty name, I corrected it for you to be more informative, actually, the problematic node name shows empty when error occurs and then again displayed correct on next check interval).

Any warnings, errors within the switch logs?

Would you, please, suggest how to check it? In summary, the link temporarily goes down that cause the calculation to fail. Are there specific InfiniBand logs I have to check to track down the problem? We are not administrators or hardware specialists, our regular administrators only work with Windows machines so CFD engineers need to solve cluster problems themselves.

Why would you think this is a FW issue as you have 6 stable nodes?

There may be some parameter that is close to it’s critical value. For example, in overclocking (I don’t do it but it’s known fact) some CPUs are stable and some are not under identical conditions (voltage and speed). May be something similar.

Did you try to move the HCA to a different slot on these particular servers?

There are blades with one PCI slot. InfiniBand chips are soldered on motherboards. We tried to plug the PCI InfiniBand card into that slot, it worked but it didn’t help to cope with issue. There was no difference in error behavior with built-in or external InfiniBand controller on the same node.

Did you noticed a pattern when the issue occured?

I would say that there is no particular pattern. There was a time when often (but not always!) calculation crashed within some minutes when they was started, but now they can run for hours or days and then crash.

But there was definitely no errors when I used synthetic load like ibv_rc_pingpong for the long time. I don’t remember details now but I recall that there was surprisingly no errors with synthetic load, I don’t know why. The calculation software is Ansys CFX/Fluent (all official versions with official license), it’s used by many engineers and such an issue should be known. Maybe type of load matters.