Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]
I’m using Mellanox 56G FDR with SRIOV on KVM virtualization, and I want to use the RDMA to communicate between VM with FDR Virtual Function.
- Operating system/version: CentsOS 7.3
- Computer hardware: KVM Virtualization
- Network type: 56G FDR – Virtual Function
- OpenMPI Version - Open MPI
Build Openmpi
wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz https://mail.datadirectnet.com/owa/redir.aspx?C=iWvISp8mF9-N0Ld6MeQeXYAjCqQA9s2udCQl7rqCxGXCu93F-4jVCA..&URL=https%3A%2F%2Fwww.open-mpi.org%2Fsoftware%2Fompi%2Fv3.0%2Fdownloads%2Fopenmpi-3.0.0.tar.gz
tar -zxf openmpi-3.0.0.tar.gz
mv openmpi-3.0.0 openmpi-3.0.0-src
mkdir openmpi-3.0.0
./configure --prefix=/mnt/lustre_client/pasokan/openmpi-3.0.0/openmpi-3.0.0
make all install
on one node ./IOR running with OpenMPI but with two node it fails with “][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2”
One Node
[root@vcn03 C]# mpirun --allow-run-as-root -np 1 -host vcn03 ./IOR
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: vcn03
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4114
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
[vcn03][[33605,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument
IOR-2.10.3: MPI Coordinated Test of Parallel I/O
Run began: Tue Mar 13 11:50:15 2018
Command line used: ./IOR
Machine: Linux vcn03
Summary:
api = POSIX
test filename = testFile
access = single-shared-file
ordering in a file = sequential offsets
ordering inter file= no tasks offsets
clients = 1 (1 per node)
repetitions = 1
xfersize = 262144 bytes
blocksize = 1 MiB
aggregate filesize = 1 MiB
Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)
write 312.36 312.36 312.36 0.00 1249.44 1249.44 1249.44 0.00 0.00320 EXCEL
read 996.42 996.42 996.42 0.00 3985.69 3985.69 3985.69 0.00 0.00100 EXCEL
Max Write: 312.36 MiB/sec (327.53 MB/sec)
Max Read: 996.42 MiB/sec (1044.82 MB/sec)
Run finished: Tue Mar 13 11:50:15 2018
two node run
[root@vcn03 C]# mpirun --allow-run-as-root -np 2 -host vcn03,vcn04 ./IOR
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: vcn04
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4114
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
[vcn03][[33640,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument
[vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument
mlx5: vcn04: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 78006802 0a00016f 00005bd2
[vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2
[vcn04:28705] *** An error occurred in MPI_Send
[vcn04:28705] *** reported by process [2204631041,1]
[vcn04:28705] *** on communicator MPI_COMM_WORLD
[vcn04:28705] *** MPI_ERR_OTHER: known error not in list
[vcn04:28705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[vcn04:28705] *** and potentially your MPI job)
[vcn03:05349] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[vcn03:05349] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[root@vcn03 C]#