mvapich - mpirun timeouts

I’ve compiled the same code, using PGI’s CDK v 12.4. One version is compiled, linked and run w/ mpich, the other w/ mvapcih.

Running the mpich version (w/ …/mpich/bin/mpirun) is fine, but running the mvapich (w/ …/mvapich/bin/mpirun) produce a large set of cases with

Timeout during client startup.
ERROR: Reached mpirun timeout.  Attempting to cleanup job.
If this job is not an MPI application, you may want to run it
directly (without mpirun) or via "srun --mpi=none", if available.
Killing remote processes...Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.

The same case, re-queued, will eventually run, but the timeouts account for 28 of 72 cases (more than 1 in 3).

Any idea what may cause this? The pgi/cdk was installed as provided by PG, not build from src

thx, S.

Hi S,

Not being an MVAPICH user myself, I asked some other PGI Application Engineers for ideas. Here’s one response:

I’ve run into many many Infiniband timeouts that were eventually resolved by tweaking the environment variables in some manner. Without very verbose output from the run, it’s hard to give ideas as to what exactly to change.

As far as I know - when the user links with MPICH, they end up running over the Ethernet fabric.

When linking with MVAPICH, they will be using the Infiniband fabric. These messages indicate to me that there is possibly an issue with the Infiniband hardware or software. I’d start to trying to diagnose the fabric with some point to point pings and small group broadcasts. There are ways to turn on much more verbose messages to get a better idea of what is going on with the job launch and execution.

It can also be the case that MVAPICH environment variables can be set to better tune the fabric for the code the user is running. Some of these variables control the buffer space and the protocols used for different message sizes.

There have also been reports that using tcsh rather then bash to launch MVAPICH jobs works better - but unknown as to why that might be the case.

My suggestion would be to post on the MVAPICH website as they are much better equipped to help track down MVAPICH issues then we are.

  • Mat

Another engineer asked:

Does he use ssh instead of rsh? That might be the problem. I get
timeout too. I changed to use -rsh.

mpirun_rsh -rsh -np 2 -hostfile mymachine myname.out

mpirun_rsh -show … will show if it uses rsh or ssh.

Also if this is being run on a larger cluster, there is some chance that a node or two doesn’t have the users sshkey installed. When a job lands on those nodes, it will not run as the user can’t login, but when it lands on nodes that all allow passwordless login, it will work.