When I run on 2 nodes I get this error. Any help would be appreciated.
The InfiniBand retry count between two MPI processes has been
exceeded. “Retry count” is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI’s behavior with
respect to the retry count:
- btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
- btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 20). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
Below is some information about the host that raised the error and the
peer to which it was connected:
Local host: node1
Local device: mlx5_0
Peer host: node2ib
You may need to consult with your system administrator to get this
problem fixed.
Primary job terminated normally, but 1 process returned
a non-zero exit code… Per user-direction, the job has been aborted.
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[1590,1],4]
Exit code: 255