I’m not sure why OpenMPI cannot find my HCAs as the system is returning the following state:
PCI devices:
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX5(rev:0) NA 5e:00.0 mlx5_0 net-enp94s0f0 0
ConnectX5(rev:0) NA 5e:00.1 mlx5_1 net-enp94s0f1 0
but calling the app with either {mpirun -np 36 --mca btl_openib_if_include mlx5_0:1 -x UCX_NET_DEVICES=mlx5_0:1 -x HCOLL_MAIN_IB=mlx5_0:1 app.exe}, {mpirun -np 36 --mca btl_openib_if_include mlx5_0 -x UCX_NET_DEVICES=mlx5_0 -x HCOLL_MAIN_IB=mlx5_0 app.exe} or {mpirun -np 36 --mca btl_openib_if_include mlx5_1 -x UCX_NET_DEVICES=mlx5_1 -x HCOLL_MAIN_IB=mlx5_1 app.exe} always return some error message:
[1588295158.027413] [baseHPCbench:26725:0] ucp_context.c:690 UCX WARN network device ‘mlx5_0:1’ is not available, please use one or more of: ‘eno2’(tcp)
[1588295187.676307] [baseHPCbench:27101:0] ucp_context.c:690 UCX WARN network device ‘mlx5_0’ is not available, please use one or more of: ‘eno2’(tcp)
[1588295315.261353] [baseHPCbench:28270:0] ucp_context.c:690 UCX WARN network device ‘mlx5_1’ is not available, please use one or more of: ‘eno2’(tcp)
I’m still seeing the same issue. I upgraded to CentOS 7.8 with MOFED 5.0-2.1.8.0-rhel7.8-x86_64 and everything seems to start OK. The status command returns:
I apologize for the long delay as several other matters required my immediate attention. Excluding ‘btl_openib_if_include’ didn’t make any difference as OpenMPI-UCX is still unable to find the device (same error messages). The device (at least mlx5_0) is there: