Hello: I have three computing nodes equipped with ConnectX-4 dual port Mellanox cards.
Each compute node is directly connected with the other two nodes (sort of hyper-cube).
If I start a subnet on two nodes, I’m able to start an MPI (RDMA) job on those two nodes. If I start two subnets and try to execute my application on the three nodes, the MPI processes are started on all compute nodes, but after a few seconds the job fails.
I tried to follow this suggestion:
https://community.mellanox.com/s/feed/0D51T00006Sn2QqSAJ
but it doesn’t seems to be working in my case.
Can anyone help me understand how to configure this kind of setup?
Thank you !
Emanuele
chenh1
April 21, 2019, 8:36am
#2
Hi Emanuele,
Can you please provide the error that you are seeing when the job fails?
Regards,
Chen
Hi Chen,
sorry for the late reply
I have 3 nodes called DUMBO, TIMOTEO and JIMCORVO
this is part of /etc/hosts on Timoteo:
10.10.3.2 TIMOTEO21 TIMOTEO tim-ib
10.10.5.2 TIMOTEO23
10.10.3.1 jimcorvo12 JIMCORVO jim-ib
10.10.4.1 jimcorvo13
10.10.5.3 DUMBO32 DUMBO dumbo-ib
10.10.4.3 DUMBO31
I try to execute the job with command:
mpirun -genvall -genv I_MPI_HYDRA_DEBUG 1 -genv I_MPI_FABRICS=shm:ofi -n 24 -ppn 8 -hostfile hostfile ./wrf.exe
“hostfile” contains the names of the 3 nodes
the application is reporting errors like:
Abort(1014056975) on node 7 (rank 7 in comm 0): Fatal error in PMPI_Comm_dup: Other MPI error, error stack:
PMPI_Comm_dup(179)…: MPI_Comm_dup(MPI_COMM_WORLD, new_comm=0x7ffe1f481868) failed
PMPI_Comm_dup(164)…:
MPIR_Comm_dup_impl(57)…:
MPII_Comm_copy_with_info(702)…:
MPIR_Get_contextid_sparse_group(498): Failure during collective
I’m also linking the console output that I’m getting from mpiexec and the pcap file collected by ibdump on one of the two port on Timoteo
https://www.dropbox.com/s/z3e1wm9r4njyqfl/mpiexec_output.txt?dl=0
https://www.dropbox.com/s/gmgsjcjdfugx896/sniffer.pcap?dl=0
Please let me know if further details are required
thanks in advance!
Emanuele