MLNX_OFED_LINUX- centos 7.2

When I run on 2 nodes I get this error. Any help would be appreciated.

The InfiniBand retry count between two MPI processes has been

exceeded. “Retry count” is defined in the InfiniBand spec 1.2

(section 12.7.38):

The total number of times that the sender wishes the receiver to

retry timeout, packet sequence, etc. errors before posting a

completion error.

This error typically means that there is something awry within the

InfiniBand fabric itself. You should note the hosts on which this

error has occurred; it has been observed that rebooting or removing a

particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI’s behavior with

respect to the retry count:

  • btl_openib_ib_retry_count - The number of times the sender will

attempt to retry (defaulted to 7, the maximum value).

  • btl_openib_ib_timeout - The local ACK timeout parameter (defaulted

to 20). The actual timeout value used is calculated as:

4.096 microseconds * (2^btl_openib_ib_timeout)

See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the

peer to which it was connected:

Local host: node1

Local device: mlx5_0

Peer host: node2ib

You may need to consult with your system administrator to get this

problem fixed.

Primary job terminated normally, but 1 process returned

a non-zero exit code… Per user-direction, the job has been aborted.

forrtl: error (78): process killed (SIGTERM)

forrtl: error (78): process killed (SIGTERM)

forrtl: error (78): process killed (SIGTERM)

forrtl: error (78): process killed (SIGTERM)

forrtl: error (78): process killed (SIGTERM)

forrtl: error (78): process killed (SIGTERM)

mpirun detected that one or more processes exited with non-zero status, thus causing

the job to be terminated. The first process to do so was:

Process name: [[1590,1],4]

Exit code: 255

Hi Rasmus,

In general, retry count error when running MPI jobs may indicate of the fabric health issue.

You should check and confirm that the firmware and driver levels on the nodes is the same, I also recommend to run ibdiagnet diagnostic tool and send the output to in order to confirm the fabric health.

by invoking the following: ibdiagnet -r -pc -P all=1 --pm_pause_time 600