Mat , Dave. Thank for your kindly reply.
I have an another question. If I use the OpenMPI 1.10.2 which is built in PGI community edition.
The following is a simple MPI test code (testmpi.f90)
program mpitest
use mpi
implicit none
integer :: ID,Proc,Ierr,namelen
character(MPI_MAX_PROCESSOR_NAME) :: processor_name
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,id,ierr)
call mpi_comm_size(mpi_comm_world,proc,ierr)
call mpi_get_processor_name(processor_name,namelen,ierr)
write(*,*) " Hello ! I am : " , ID , " on " , processor_name
call mpi_finalize(ierr)
end program
Using the command to compile the code : mpif90
Test 01 : Run the code by using 3 nodes (master,slave1,slave2)
Command : mpirun -host master,slave1,slave2 -np 3 ./a.out
The result is as follows
Hello ! I am : 0 on master
Hello ! I am : 1 on slave1
Hello ! I am : 2 on slave2
Test 02 : Run the code by using 3 nodes (master,slave1,slave3)
Command : mpirun -host master,slave1,slave3 -np 3 ./a.out
The result is as follows
Hello ! I am : 0 on master
Hello ! I am : 1 on slave1
Hello ! I am : 2 on slave3
Test 03 : Run the code by using 3 nodes (master,slave2,slave3)
Command : mpirun -host master,slave2,slave3 -np 3 ./a.out
The result is as follows
Hello ! I am : 0 on master
Hello ! I am : 1 on slave2
Hello ! I am : 2 on slave3
Test 04 : Run the code by using 3 nodes (master,slave1,slave2,slave3)
Command : mpirun -host master,slave1,slave2,slave3 -np 4 ./a.out
The code returns the error message
[slave3:15722] [[7177,0],3] tcp_peer_send_blocking: send() to socket 10 failed: Broken pipe (32)
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
-
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
-
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
-
the inability to write startup files into /tmp (–tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
-
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
-
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
The error message means that the communication between the master and
slave3 node has a problem. However, the Test 01-03 show that thers is
no communication problem between master and slave3 node
. Now I change the OpenMPI version from 1.10.2 to 2.0.2 and run the Test04 again.
The result is correct
Hello ! I am : 0 on master
Hello ! I am : 1 on slave1
Hello ! I am : 2 on slave2
Hello ! I am : 3 on slave3
Is there someone can answer this question ? Is this a bug of OpenMPI 1.10.2 ?
Thanks
Neo