Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]

Problem while running application on multiple nodes on SR-IOV enviroment using OpenMPI [Build from source]

I’m using Mellanox 56G FDR with SRIOV on KVM virtualization, and I want to use the RDMA to communicate between VM with FDR Virtual Function.

  • Operating system/version: CentsOS 7.3
  • Computer hardware: KVM Virtualization
  • Network type: 56G FDR – Virtual Function
  • OpenMPI Version - Open MPI

Build Openmpi

wget https://www.open-mpi.org/software/ompi/v3.0/downloads/openmpi-3.0.0.tar.gz https://mail.datadirectnet.com/owa/redir.aspx?C=iWvISp8mF9-N0Ld6MeQeXYAjCqQA9s2udCQl7rqCxGXCu93F-4jVCA..&URL=https%3A%2F%2Fwww.open-mpi.org%2Fsoftware%2Fompi%2Fv3.0%2Fdownloads%2Fopenmpi-3.0.0.tar.gz

tar -zxf openmpi-3.0.0.tar.gz

mv openmpi-3.0.0 openmpi-3.0.0-src

mkdir openmpi-3.0.0

./configure --prefix=/mnt/lustre_client/pasokan/openmpi-3.0.0/openmpi-3.0.0

make all install

on one node ./IOR running with OpenMPI but with two node it fails with “][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2”

One Node

[root@vcn03 C]# mpirun --allow-run-as-root -np 1 -host vcn03 ./IOR


WARNING: No preset parameters were found for the device that Open MPI

detected:

Local host: vcn03

Device name: mlx5_0

Device vendor ID: 0x02c9

Device vendor part ID: 4114

Default device parameters will be used, which may result in lower

performance. You can edit any of the files specified by the

btl_openib_device_param_files MCA parameter to set values for your

device.

NOTE: You can turn off this warning by setting the MCA parameter

btl_openib_warn_no_device_params_found to 0.


[vcn03][[33605,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

IOR-2.10.3: MPI Coordinated Test of Parallel I/O

Run began: Tue Mar 13 11:50:15 2018

Command line used: ./IOR

Machine: Linux vcn03

Summary:

api = POSIX

test filename = testFile

access = single-shared-file

ordering in a file = sequential offsets

ordering inter file= no tasks offsets

clients = 1 (1 per node)

repetitions = 1

xfersize = 262144 bytes

blocksize = 1 MiB

aggregate filesize = 1 MiB

Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)


write 312.36 312.36 312.36 0.00 1249.44 1249.44 1249.44 0.00 0.00320 EXCEL

read 996.42 996.42 996.42 0.00 3985.69 3985.69 3985.69 0.00 0.00100 EXCEL

Max Write: 312.36 MiB/sec (327.53 MB/sec)

Max Read: 996.42 MiB/sec (1044.82 MB/sec)

Run finished: Tue Mar 13 11:50:15 2018

two node run

[root@vcn03 C]# mpirun --allow-run-as-root -np 2 -host vcn03,vcn04 ./IOR


WARNING: No preset parameters were found for the device that Open MPI

detected:

Local host: vcn04

Device name: mlx5_0

Device vendor ID: 0x02c9

Device vendor part ID: 4114

Default device parameters will be used, which may result in lower

performance. You can edit any of the files specified by the

btl_openib_device_param_files MCA parameter to set values for your

device.

NOTE: You can turn off this warning by setting the MCA parameter

btl_openib_warn_no_device_params_found to 0.


[vcn03][[33640,1],0][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

[vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1235:udcm_rc_qp_to_rtr] error modifing QP to RTR errno says Invalid argument

mlx5: vcn04: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 78006802 0a00016f 00005bd2

[vcn04][[33640,1],1][connect/btl_openib_connect_udcm.c:1575:udcm_wait_for_send_completion] send failed with verbs status 2

[vcn04:28705] *** An error occurred in MPI_Send

[vcn04:28705] *** reported by process [2204631041,1]

[vcn04:28705] *** on communicator MPI_COMM_WORLD

[vcn04:28705] *** MPI_ERR_OTHER: known error not in list

[vcn04:28705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

[vcn04:28705] *** and potentially your MPI job)

[vcn03:05349] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found

[vcn03:05349] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

[root@vcn03 C]#